Introduction

Somatic tumorigenesis involves loss-of-function events affecting a heterogeneous class of regulators designated tumor suppressor genes (TSGs), which are sometimes also termed as recessive oncogenes to distinguish them from their dominant transforming (gain-of-function, proto-oncogene; PO) counterparts.1 Two functional TSG subgroups are hence defined: gatekeeper genes (GKs) such as RB1 and TP53, which control cell cycling and cell death, and caretaker genes (CTs) such as BRCA1 and MLH1, which maintain DNA repair and genome stability.2 Loss-of-function mutations often affect GKs such as TP53 and others in sporadic tumors,3, 4 consistent with an important role for GKs in regulating clonal tumor proliferation. In contrast, we have shown in earlier work that CT dysfunction, which could be subtle, polymorphic and/or multiallelic5—is better tolerated in the germline than GK dysfunction is, implicating environmental selection for germline CT gene methylation as a mechanism that may contribute both to maintenance of genome plasticity for the species, and also to heritable but non-familial cancer predisposition.6

Frequent occurrence in sporadic tumors of TSG protein overexpression without gene mutation—as has been reported for TP53,7 CDKN2A,8 RB1,9 SMAD410 and CDH1,11 among others—with tumor-promoting effects is less understood.12, 13 As confirmed by literature data mining,14 such overexpression is more strongly associated with GKs rather than with CTs (P<0.001; Supplementary Table S1); this suggests that GKs express a selectable phenotype, which is lacking in CTs. Relevant to this hypothesis, we and others have shown that certain POs (such as MYC and HER2) can promote anti-oncogenic cell phenotypes such as differentiation,15 growth arrest16 or apoptosis in cells with relatively normal cell cycle control machinery;17 in contrast, when selected for ‘cooperative’ GK knockout or preexisting constitutive PO expression (for example, of the RAS family18), overexpression of heterologous POs causes growth and/or transformation.19, 20, 21, 22 These observations suggest that a dynamic balance exists between expression thresholds of opposing genes within the same network,23 and that abnormal perturbation of one part of the network could possibly select for normal GK hyperfunction.24

Such apparent bifunctionality may be explained by epistasis, a concept, which teaches that the function of a given gene is not invariant but rather depends upon the genetic background.25 In the above example, a possible explanation is that GK protein overexpression may be not only tolerated but also selected in the presence of a block affecting heterologous TSGs.26, 27 Consistent with this, CT inactivation is a common tumorigenetic event predisposing to GK mutation;28 for example, repair deficiency (for example, microsatellite instability caused by mismatch repair gene inactivation) may lead to secondary mutations (for example, frameshifts) affecting GKs or POs in preinvasive tissues.29, 30 Moreover, as repair genes are involved in afferent sensing of DNA damage,31, 32 CT inactivation in tumors could also bias GK function towards a pro-mitogenic phenotype by raising the apoptotic threshold.33

Systems biology is an emerging field of interdisciplinary research that seeks to elucidate complex biological interactions by integrating otherwise disparate data sources; it is not a single well-defined field, but is a dynamic set of experimental approaches that seek to clarify central functional patterns within systems by integrating multiple information sources.34 Such an approach is increasingly used in which the description of a complex set of discrete observations fails to explain the behavior of the whole system, such as in the example above of suppressor gene overexpression in sporadic cancer. The types of techniques often used in systems biology include transcriptomics, genomics, methylomics, metabolomics and similar structure–function analytical techniques.

The present study applies the systems biology approach by using a variety of such analyses for each of the four gene groups, including gene expression (transcription) analysis, gene structure (genomic) analysis, methylation-dependent ‘signature’ (methylation and mutation) analysis and evolutionary rate (evolvability) analysis. To test the notion that GKs and POs may share functional attributes depending upon the genetic environment—that is, with wild-type alleles of both the gene families maintaining differentiation clonal outgrowth in cells with heterologous suppressor gene defects—we compare and distinguish the structural and functional features of GKs, CTs, POs and a control group of genes implicated in the pathogenesis of congenital heart disease (HD). As detailed here, this analysis provides the counter-intuitive conclusion that GKs share the genomic and evolutionary properties of POs, raising the possibility of inapparent phenotypic similarities relevant to carcinogenesis.

Materials and methods

Gene identification, classification and ontology analysis

We chose 157 genes (39 CTs, 36 GKs, 41 POs, 41 HDs; Supplementary Tables S1 and S5) from published works (http://www.sanger.ac.uk/genetics/CGP/Census).35, 36 Familial cancer genes with repair functions37, 38 were categorized as CTs, whereas other familial suppressors39—most of which were confirmed to mediate apoptosis40—were classified as GKs. In addition, only putative POs with transforming viral homologs were classed as POs. Given the multigenic interdependence of DNA repair and cellular apoptosis,33 unambiguous identification of genes that exclusively mediate one of these two processes is not straightforward. We sought to minimize uncertainties over this functional overlap in two ways: first, we used a familial tumor suppressor gene database35, 41 to restrict the choice of genes to those with major neoplastic effects (that is, heritable cancer syndromes) when deleted in the germline; this provided a total of 75 TSGs; and second, by cross-correlating the former data set with a database of DNA repair genes38 we subclassified this familial cancer susceptibility gene subset as CTs (n=39), then designated the remainder—the majority of which were confirmed to mediate apoptosis40—as GKs (n=36). An additional control group of heart development genes implicated in congenital HDs was also defined using the NCBI Entrez Gene (27 September 2009 release) and keyword search on ‘congenital heart disease’ and ‘Homo sapiens’. Gene ontology analysis42 was performed using the Panther database (http://www.pantherdb.org/).43

Analyses of gene sequences and mutations, and non-negative matrix factorization

We used sequence analysis to infer functional properties of gene sequences as in our previously cited publications. Briefly, enzymatic methylation of cytosines in CpG dinucleotides clustered within gene promoters leads to transcriptional repression and chromatin condensation, whereas methylcytosine residues in coding regions may undergo oxidative deamination to form thymine residues; if the mismatch repair system fails to rectify these mutations, an excess of CG → TA transitional mutations becomes a quantifiable hallmark of the foregoing methylation events.44 This interaction between methylation-dependent trans repression and mutation is in turn a factor of positive selection,45 perhaps contributing to the nonrandom correlations between the codon structure and function that we have previously reported.46, 47 Inter-species changes in genomic GC content48 could also derive, in part, from such a mutational mechanism. Nucleotide sequences may similarly be examined for the presence or otherwise of strand-specific (sense vs antisense) dinucleotide asymmetries,49 whereas transcription-coupled repair creates asymmetric patterns of base composition that can be mined to support retrospective inferences of differential transcription.50 Human and mouse reference sequences, and species gene numbers were downloaded from NCBI Gene (http://www.ncbi.nlm.nih.gov/Gene), whereas mutation data were downloaded from the Human Gene Mutation Database. A variety of packages from R 2.81 (http://www.r-project.org) were used for statistical analysis, including coin, biomaRt, GeneR, nlmc and others. To analyze coding sequences, we used reference data from the NCBI Entrez Gene or ENSEMBL and updated using R scripts. For multiple splicing forms, the longest coding sequence was used for analysis; mono- and dinucleotide composition was assessed using in-house Perl scripts and/or GeneR package in R2.81. Comparison of mutation rates in germline (that is, familial cancers) and somatic (sporadic) tumors was based on Cancer Genome Anatomy Project Gene Census list (http://www.sanger.ac.uk/genetics/CGP/Census).35, 36 Frame-dependent dinucleotide composition and asymmetries were analyzed using the GeneR package. For comparative analyses of 5′-and 3′-untranslated regions (UTR), reference sequences were downloaded from ENSEMBL (Release 52) using R package biomaRt. Non-negative matrix factorization (NNMF) was performed using MATLAB (7.6) Statistical Toolbox (5.1) (http://www.mathworks.com) for principal component analyses, and non-parametric tests.

Analysis of evolutionary rate

The latest gene evolutionary rate data was downloaded from ENSEMBL (release 52) using biomaRt packages running in R platform. Evolutionary rates were computed with maximum likelihood method using the PAML packages. Gene expression intensity and breadth (specificity) were related to molecular evolutionary rates as approximated by the ratio of non-synonymous to synonymous mutation rate (Ka/Ks or dN/dS, in which the former is treated as non-neutral whereas the latter is treated as neutral).51

Gene expression analysis

We mined reference data from the University of California Santa Cruz genome database, then analyzed these data using a non-supervised Euclid distance hierarchical cluster method.52 The aim of such cluster analysis is to measure similarities between different data points, which is otherwise difficult to achieve visually for a relatively small data set of 157 genes; we therefore visualized the four gene groups in n-dimensional space where n is the number of tissue classes, in which context Euclidean distance is simple to compute. The Clustergrams so derived represent heatmaps with dendrograms for hierarchical clustering of matrix data with column dendrograms; the rows represent genes and the columns represent samples. Default clusters are created by average linkage with Euclidean distance metric, whereas hierarchical cluster trees are created using a single-linkage algorithm in which the input Y is a distance matrix. Similar results were generated using other methods. In addition, to quantify the transcriptomic similarity of GKs and POs, and their distinction from CTs and HDs, we conducted a multinomial logistic regression analysis. As our interest focuses on the four modules (CTs, GKs, POs, HDs), we used the median value of the given categories for analysis. Fisher's exact test was used to compare the modules.

To address the problem of computing the P-value for a cluster, we also used an R package, ‘pvclust’ (www.is.titech.ac.jp/%7Eshimo/prog/pvclust/), which is suitable for computing unbiased P-values. We downloaded the relevant gene reference expression data, combined from both human and mouse genomes, using the University of California Santa Cruz gene sorter (GNF Atlas 2—GNF Expression Atlas 2 Data from U133A and GNF1H Chips), analyzed the median value for each group of genes, then conducted hierarchical cluster analysis with multiscale bootstrap (number of bootstrap=10 000 simulations) using an average method and correlation-based dissimilarity matrix.

Results

Gene subgroup identification

Three classes of ‘cancer genes’—CTs, GKs and POs—were defined by hereditary cancer syndromes (CTs and GKs) and/or homologous viral oncogenesis (POs). The genes so identified are listed in Supplementary Table S2. An additional control group of 41 functionally unrelated cardiac development genes implicated in congenital HDs, was also defined (see Materials and methods).

Functional gene classification and mutation frequency

We first determined the biological processes associated with CTs, GKs, POs and HDs using Panther, and established that the functions of POs and GKs are overlapping, but that this is not true for CTs or HDs: the top four processes implicated for GKs and POs (such as, oncogenesis, cell cycle control, cell proliferation and differentiation and protein phosphorylation) are common to both groups, whereas the top five processes for CTs (DNA repair, DNA metabolism, nucleoside metabolism, meiosis and DNA recombination) are exclusive to that group (Supplementary Table S3). A similar molecular function comparison confirms that the top five functional strings (related to kinases, receptors and transcription factors) associated with GKs are shared by POs, whereas CTs and HDs share only one function with GKs and POs (Supplementary Table S4).

A further comparison of ‘cancer gene’ mutation rates in the germline (that is, predisposing to familial cancers) and in somatic (sporadic) tumors was then performed on CTs, GKs and POs, showing that there are 34 genes with both germline and somatic mutations, 38 genes with germline mutations only and 310 genes with somatic mutations only; germline-only mutations often involve DNA repair (‘mutation modifier’) genes, somatic-only mutations more typically affect POs, whereas GKs may be affected in either context (Supplementary Figure S1). These data confirm a clear soma-germline difference in gene mutation profile: tumor-permissive repair gene (including CT) dysfunction is selectively tolerated in the germline, whereas other tumorigenic mutations (that is, including GK and PO) are more common in somatic tissues 2-test, P<2.2e−16). The conclusion that germline GK and PO losses are more embryonic lethal than CTs again supports a shared pivotal role for these genes in enhancing survival.

Gene length and GC content

Like the height and weight of animals, the length and GC content of coding sequences are fundamental features of genes, with implications for transcription and methylation frequency53, 54, 55 that in turn reflect interactions between stochastic genetic determinants and fluctuating environments.56, 57 Figure 1a shows that the coding sequence lengths of GKs and POs are similar to each other but significantly shorter than that of CTs, whereas GC and GC3 contents (GC content within the third site of codons) are significantly higher in GKs and POs than in CTs (Figures 1b and c) as revealed by a multiple logistic model (P=0.0301, P=0.0003 and P=0.0001, for length, GC and CG3s, respectively). Gene length may thus be decreasing, whereas GC and GC3 are increasing from CTs to GKs/POs (Kendall trend analysis, τ=−0.14, two-sided P=0.0540 for length, τ=0.287; two-sided P=7.605 × 10−5 for GC, τ=0.327, two-sided P=6.681 × 10−6). Simple length vs GC3 plot illustrates the structural similarity of GKs and POs, and their distinction from CTs (Figure 1d). We also conducted post hoc non-parametric Nemenyi–Damico–Wolfe–Dunn testing (Supplementary Table S5), including pairwise comparisons (Supplementary Table S6), confirming that GKs resemble POs in these key aspects.

Figure 1
figure 1

Coding sequence features of caretakers (CTs), gatekeepers (GKs) and proto-oncogenes (POs). (ac) Boxplot diagrams of gene length, GC and GC3s, respectively; (d). Dot plot of gene length and GC3s, illustrating the structural similarities of GKs and POs, and their distinction from CTs.

Dinucleotide content and tissue-specific gene expression patterns

More specific differences in methylation-dependent dinucleotide content between GKs, POs and CTs are summarized in Table 1. Frame-dependent dinucleotide alignment and evolutionary analysis have shown that CpG sites in GKs and POs fix missense mutations more often than do those in CTs, and that similar trends apply to whole-genome comparison of apoptosis vs repair genes.6 As the minimum sequence motif is a dinucleotide, we hypothesized that nonrandom alterations in dinucleotide patterns—including asymmetries due to transcription-coupled repair—may be selected from transgenerational variations, the frequency of which varies with germline gene methylation.58 With this in mind, directional analyses of methylation-dependent dinucleotides in the target gene sets reveal an asymmetry: not only are GKs and POs characterized by a higher CpG content than CTs but also by a selectively lower CpA—but not TpG—content, suggesting greater transcription-coupled repair of methylation-dependent mutation in GKs and POs than in CTs.53 This functional conclusion is supported by cross-species gene expression data indicating higher and more tissue-matched transcription frequencies of GKs and POs compared with CTs, with no similarity of either group to control HDs (P<0.03; Figure 2). We also used the R-package ‘pvclust’ to compute the corresponding P-values, based on both the approximately unbiased and bootstrap probability (BP), as shown in Supplementary Figures S5A and S5B. We submit that the topology depth difference in this expression data set supports our conclusion.

Table 1 Non-parametric comparison of frame-dependent dinucleotide component of CTs, GKs, and POs
Figure 2
figure 2

Gene expression profiles of caretakers (CTs), gatekeepers (GKs), proto-oncogenes (POs) and heart diseases (HDs) in human and mouse. The original data were downloaded from the University of California Santa Cruz genome site (http://genome.ucsc.edu). A hierarchical cluster analysis of gene data from gnfHumanAtlas2 and gnfMouseAtlas2 (Su AI et al.52) was used for classification with MATLAB clustergram function, computing median values for respective tissues. We compared the patterns seen in various tissues in the four categories using non-parametric methods. Heatmap graph data were scaled from −1 to +1 for Euclid distance. This analysis confirms that gene expression patterns are significantly different between CTs and HDs and GK/POs, both in mouse and in human (P<0.025).

Untranslated region, promoter region and flanking region analyses

Functions of the 5′-and 3′-UTRs include transcriptional regulation, mRNA stability and translational efficiency.59 These UTRs remain under evolutionary selection pressure,60 in turn confirming that conserved sequence homologies reflect associated selectable phenotypes.61 Consistent with this, UTR sequences may modify TSG function through either antisense62 or translational inhibition mechanisms,63 and may directly regulate cell growth and death.64, 65 For these reasons, we compared UTR length and dinucleotide composition in the three defined classes of cancer genes. As shown in Supplementary Figure S2, both the 5′-and 3′-UTRs of CTs are lower in GC content than are those of GKs and POs (multiple Behrens–Fisher Test, CTs vs GKs, P<0.05, CTs vs POs, P<0.001, GKs vs POs, P>0.35). Similarly, both the 5′-and 3′-UTRs of CTs are shorter in length than those of GKs and POs (lower panels, multiple Behrens–Fisher Test, CTs vs GKs, P<0.025, CTs vs POs, P<0.001, GKs vs POs, P>0.30); this finding again suggests an underlying functional similarity between POs and GKs, given that genes functioning in growth and apoptosis tend to be characterized by longer structured 5′-UTRs.65 To compare the methylation-dependent dinucleotide composition in 5′-UTR and 3′-UTR, we analyzed the distribution of relevant dinucleotides. As shown in Supplementary Figures S3A and S3B, the distributions of CpG, TpG and CpA reveal significant differences (with P<0.001 cutoff) between CTs and GKs/POs with respect to 5′-UTR CpG and TpG, but not to CpA or 3′-UTR, suggesting that these differences may arise, in part, due to variations in transcription-coupled DNA repair. With respect to flanking sequences, all parameters are significant between CTs and GKs/POs for 5′ 1 kb flanking sequences, but for 3′ 1 kb flanking sequence only TpG and CpA are significant, consistent with 3′ attenuation of transcription shown in our earlier report.

Non-negative matrix factorization

To assess this structural similarity between GKs and POs, we next proceeded beyond multiple pairwise comparisons. The term (NNMF) refers to a method in which algorithms in multivariate analysis are factorized into matrices by incorporating different constraints, for example, using principal component analysis, in which all elements must be 0.66, 67, 68 We therefore conducted analysis of the coding sequence pattern with an NNMF method that is ideal for non-zero variables, in addition to a linear normalization and transform demonstrated by principal component analysis. We used NNMF to ask the question: given the dinucleotide composition of the 116 putative cancer genes, how many meta-genes and meta-dinucleotides have an informational content similar to the data matrix therein? As all the data were non-negative, we tested whether the three modules (CTs, GKs, POs) were differentially composed of methylation-sensitive (for example, CG) and non-sensitive (for example, AA, TT) dinucleotides for all of the eight-dimensional data including 5′-UTR, 3′-UTR, 5′ 1 kb flanking, 3′ 1 kb flanking, total dinucleotide, frame-specific and dinucleotides. Figure 3 shows the close similarity of GK and PO meta-genes, but also shows the clear difference of CTs.

Figure 3
figure 3

Non-negative matrix factorization analysis of dinucleotide pattern analysis, performed in MATLAB 7.6 platform. Green, yellow and red color dots represent caretakers (CTs), gatekeepers (GKs) and proto-oncogenes (POs), respectively. The data were computed with Statistical toolbox 5.1 using default non-negative matrix factorization (NNMF) parameters. The x and y axes correspond to the first and second columns of matrix W (W1 and W2), which were in turn computed by the NNMF algorithm, O=W × M, where O is the object n by m matrix, W is n by x and M is x by m. Results were derived using MATLAB statistical toolbox NNMF functions.

Molecular evolutionary rate analysis

Statistical analysis of gene evolutionary rates for the three modules confirms that CTs are evolving significantly faster than either GKs or POs, both of which are under strong purifying selection (non-parametric test, P<0.0001 for dN and dN/dS, but not dS; Figure 4). This conclusion is reinforced across phylogeny (Supplementary Table 7, including HD data, and Supplementary Figure 4).

Figure 4
figure 4

Molecular evolutionary analysis of caretakers (CTs), gatekeepers (GKs) and proto-oncogenes (POs). Data were downloaded from ENSEMBL (version 50 release) and measured using maximum likelihood method (PAML). Mean and inter-quartile ranges are shown. Statistics were derived using non-parametric tests (P<0.0001 for dN and dN/dS).90

Discussion

Understanding the context-dependent function of so-called ‘cancer genes’ is essential for progress in both normal cell biology and rational anticancer therapeutics. The interacting functional effects of gene networks, or epistasis, is an extension of the concept of allelic dominance, and can be quantitatively modeled using a mathematical approach.25 As ‘cancer genes’ (dominant or recessive) do not cause cancer most of the time, and many of these genes remain wild type in individual tumors, the possibility that normal growth regulatory genes of this class could be subverted to drive tumor microevolution is important to address. Another way of considering this issue involves defining the temporal sequence of genetic lesions that drive oncogenesis, and thus to infer the sequential downstream interactive effect(s) of each lesion on other as-yet-unmutated regulatory genes.69 Indeed, increasing evidence supports the importance of such epistatic processes for mammalian cancer development24, 70, 71, 72, 73 and, perhaps, also for transgenerational carcinogenesis.6, 74

The central finding of the present study is the unanticipated insight that the structural and functional attributes of GKs and POs—when compared with control subgroups of CT cancer suppressor genes and functionally unrelated but developmentally important HD genes—are strikingly similar, as indicated by congruences of evolutionary rate, expression level and breadth, gene length and methylation-dependent mutation confirmed by logistic regression and C-statistics. These data imply that GKs have undergone epigenetic evolution trajectories similar to those of POs, suggesting an explanation for the otherwise puzzling observation that wild-type GK (but not CT) expression is often selectively increased in tumors. Indeed, abundant evidence now confirms that gene behavior varies with the environment; for example, environmental DNA damage—for example, due to smoking75 or inflammation76—selects for TSG methylation in exposed target tissues, thus permitting the upregulation of heterologous wild-type prosurvival gene functions, and contributing to tumor evolution in a negatively selected manner. Consistent with this, recent work shows that the physiological ‘decision threshold’ of MYC to induce either proliferation or apoptosis77 can depend either on gene expression level23 and/or on regulatory interactions with modifier proteins.78 Conversely, clonal overexpression of POs such as HER2 in human tumors seems likely to depend upon previous inhibition of GK expression.17, 79 This epistatic phenomenon has been well described by Vogelstein and Kinzler80 who compared POs and GKs with ‘…electronic components whose effects depend on their placement within a circuit’, a view that is supported by our findings.

Two impressions emerge from the present study: first, a cautionary emphasis on the misleading nature of classifications that ascribe a single invariant function to a given gene; and second, a model that puts epistasis at the heart of carcinogenesis. In this model, upstream CT gene dysfunction (whether germline or somatic) may short circuit the afferent limb of the DNA damage response, biasing GK function towards cell survival and away from apoptosis,81 in much the same way as somatic selection for senescent GK hypofunction permits oncogenic upregulation of wild-type POs.17 Although it is well known that DNA repair defects can potentiate cellular chemosensitivity to (unrepaired) oxidative damage,82 it is less appreciated that repair deficiencies may also impair apoptosis83, 84, 85, 86—presumably through failure to sense damage—thereby, exaggerating tolerance of unrepaired damage with consequent acceleration of carcinogenesis or tumor progression due to increased genetic instability.87, 88, 89

Similar to most studies using systems biology, our conclusions are limited by their inferential and indirect nature, though we note that clinical observations of GK overexpression in human tumors are consistent with the model proposed.12 It is not our contention that GKs are the same as oncogenes, as important differences can also be identified using our approach; for example, we have documented a significantly higher frequency of gene duplicates for POs than for GKs (data not shown). Nevertheless, on the basis of the present study, we submit that GKs are dual function genes, which share with POs a key prosurvival action that is normally subjected to purifying selection, but which may promote tumorigenesis in the abnormal epistatic context of neoplastic cells. Cancer microevolution may thus be accelerated by an age- and damage-driven cascade in which upstream TSG defects select for downstream wild-type GK overexpression. Paradoxically, in this transformed epistatic context, we caution that further upregulation of GK expression by cytotoxic DNA damage could promote tumor growth (that is, efficiently select for resistance) and thus inadvertently reduce patient survival. We therefore speculate that patients whose cancers are associated with GK upregulation may have the best survival outcome if managed with minimal cytotoxic intervention. Further studies—including, ideally, direct testing of the hypothesis using in vitro experimental systems in which wild-type GK expression is inducibly upregulated in reporter cell lines distinguished by different backgrounds of TSG dysfunction will be needed to test our conclusion that overexpressed GKs represent both a biomarker of heterologous suppressor gene defects, and a valid therapeutic target in such contexts.