Pan-cancer driver copy number alterations identified by joint expression/CNA data analysis

Analysis of large gene expression datasets from biopsies of cancer patients can identify co-expression signatures representing particular biomolecular events in cancer. Some of these signatures involve genomically co-localized genes resulting from the presence of copy number alterations (CNAs), for which analysis of the expression of the underlying genes provides valuable information about their combined role as oncogenes or tumor suppressor genes. Here we focus on the discovery and interpretation of such signatures that are present in multiple cancer types due to driver amplifications and deletions in particular regions of the genome after doing a comprehensive analysis combining both gene expression and CNA data from The Cancer Genome Atlas.

pan-cancer driver copy number alterations identified by joint expression/cnA data analysis Gaojianyong Wang 1 & Dimitris Anastassiou 1,2* Analysis of large gene expression datasets from biopsies of cancer patients can identify co-expression signatures representing particular biomolecular events in cancer. Some of these signatures involve genomically co-localized genes resulting from the presence of copy number alterations (CNAs), for which analysis of the expression of the underlying genes provides valuable information about their combined role as oncogenes or tumor suppressor genes. Here we focus on the discovery and interpretation of such signatures that are present in multiple cancer types due to driver amplifications and deletions in particular regions of the genome after doing a comprehensive analysis combining both gene expression and CNA data from The Cancer Genome Atlas.
Gene co-expression signatures in cancer often involve genomically co-localized genes resulting from the presence of various biological mechanisms that include, but are not limited to, the copy number alterations (CNAs) of malignant cells 1 and the immune response against cancer cells 2 . For example, ERBB2, GRB7, MIEN1 are among the genes co-expressed in breast cancer due to the HER2 amplicon 3 , while HLA-DPA1, HLA-DPB1, HLA-DRA are among the genes co-expressed in the MHC Class II immune cluster 4 . Any co-expression signatures that are consistently present in many different cancer types are referred to as "pan-cancer" signatures, representing universal (tissue-independent) biomolecular events in cancer [5][6][7][8] . Ref. 8 studied co-expressed genes in immune cells. There are several techniques for identifying co-expression signatures involving genomically co-localized genes [9][10][11] .
We have proposed an unsupervised algorithm to identify genome-wide co-expression signatures known as attractor metagenes 5 , a version of which was focusing on genomically co-localized signature finding. Attractor metagenes have been used successfully for cancer biomarker discovery [12][13][14] .
The identification of genomically co-localized gene signatures can shed light on some complex cancer-related biological mechanisms, especially the tumor driving events caused by CNAs. CNAs involve amplified or deleted DNA regions, which have been generated by the chromosomal instability of malignant cells. If such CNAs are frequently present in cancer cells contained in multiple cancer samples, this suggests that they have an evolutionary advantage and therefore are "driver" CNAs with the tendency to create subclones in the heterogeneous tumors. Although CNAs may include in some cases a single or few oncogenes or tumor suppressor genes, in which case their pan-cancer identification covers a small genomic region containing that gene 15,16 , they typically influence DNA regions covering many genes, implying that some of these genes have synergistic functions in tumorigenesis 17 . Here we focus on pan-cancer CNAs containing multiple genes.
The previous work on identifying pan-cancer CNAs 18,19 only made use of data resulting from analysis of the genomes in the malignant cells. However, the evolutionary advantage of CNAs is based on the expression of particular genes located within the CNA genomic region. Therefore, analysis of gene expression data provides additional valuable information [20][21][22][23][24] . Some among the list of the consistently co-expressed genes, including the first and the last when sorted in terms of their genomic location, play some role in tumorigenesis and it is possible that this role is due to their synergistic functions. More generally, gene expression analysis provides helpful information for the identification of the driver genes in each preserved CNA genomic region by pointing to those genes that are consistently amplified or deleted.
In this paper, we use a novel methodology for the identification of pan-cancer co-localized gene signatures containing no less than five strongly co-expressed genes that are due to CNAs, by making use of gene expression as well as CNA data from The Cancer Genome Atlas (TCGA). Part of this method applies a pan-cancer version of a genomically co-localized attractor algorithm, which is an extension of our previous work 5

Summary.
We applied the pan-cancer genomically co-localized attractor algorithm (Materials and Methods) to the TCGA expression data of 56,830 genes and 8593 tumor cases covering eighteen major types of cancer (Table S1), using a window size of 150 genes. This resulted in the identification of 101 pan-cancer genomically co-localized gene signatures (Table S2). To designate such signatures as being caused by driver CNAs, we reasoned that they should satisfy two conditions simultaneously: They should exhibit a high association between their corresponding levels of gene expression and CNA values, and at the same time their genomic regions should frequently appear as CNAs in multiple cancer types. 76 signatures had high expression/CNA level association (P < 0.05, Table S3, Materials and Methods). 54 signatures had high amplification or deletion frequency (Table S4, Materials and Methods). 37 genomically co-localized signatures satisfied both conditions above, and were designated as being caused by CNAs in cancer cells containing cooperative oncogenes/tumor suppressor genes (Tables S3 and S4). Among those 37 genomically co-localized signatures, 25 signatures correspond to pancancer amplifications (Table 1 and Fig. 1) and 12 signatures correspond to pan-cancer deletions ( Table 2 and Some of the identified signatures are located genomically close to each other. This suggests that each of them, by itself, has sufficient evolutionary advantage (indeed, we observed that the expression levels of adjacent genomically co-localized signatures are often independent of each other), but it is also possible for an amplicon to cover multiple such regions simultaneously (Materials and Methods, Table S5).
To provide insights of the underlying biological significance in particular examples, we analyze some of such CNAs in the following sections.
Genomically co-localized signatures associated with 1q21.3-q41 amplification. We identified signature VPS72 and signature FLAD1 located on 1q21.3 amplicon (Fig. 1A,B). The expression level of signature VPS72 is strongly associated with the expression level of signature FLAD1 ( Figure S1) and these two signatures have a co-amplification frequency of 91.6% ( Figure S2, Table S5). We also identified another genomically colocalized signature, RAB3GAP2, located on 1q41, which has not been detected as a pan-cancer amplicon 18,19 . We observed that the expression level of signature FLAD1 is not associated with the expression level of signature RAB3GAP2 ( Figure S3), although they are co-amplified in 79.8% of the cancer cases ( Figure S4, Table S5).
GSEA 86 (Gene Set Enrichment Analysis) was applied to the genes of the three signatures VPS72, FLAD1 and RAB3GAP2, concluding that these genes are enriched with the GO (Gene Ontology) term 'Mitochondrion' (P < 10 −7 , Q < 10 −3 ), thus potentially helping the efforts to shed light on the underlying biological mechanism.
Genomically co-localized signatures associated with 8q13.1-24.3 amplification. We identified three genomically co-localized signatures located on the 8q arm: ARMC1, UTP23 and SHARPIN (Fig. 1L,M,N). The expression plots between signature ARMC1 and signature UTP23 show that they are associated with each other ( Figure S5), and that they are co-amplified in 76.2% of the cancer cases ( Figure S6, Table S5). This suggests that there is a synergistic effect between them. On the other hand, the expression levels of signature UTP23 and signature SHARPIN are independent ( Figure S7) although these two signatures are co-amplified in 77.6% of the cancer cases ( Figure S8, Table S5).
Genomically co-localized signatures associated with 1p36.33-22 deletion. We identified two genomically co-localized signatures, UBE2J2 and MIIP ( Fig. 2A,B) located on 1p36.33-36.22 that have not been detected as pan-cancer deleted regions 18,19 . The expression levels of signature UBE2J2 and signature MIIP are strongly associated with each other ( Figure S9) with co-deletion frequency of 70.3% ( Figure S10, Table S5), suggesting these GLAs can either independently exist or be co-deleted. Among the genes in these two signatures, gene AURKAIP1 down-regulates the Aurora-A oncogene 67 . Gene FAAP20 is needed in DNA repair pathway 68 . The deletion of gene MIIP can induce chromosomal instability 69 . Tumor suppressor gene MAD2L2 inhibits cancer growth 70 . GSEA was applied to the genes of the two signatures UBE2J2 and signature MIIP, concluding that these genes are enriched in the GO term 'Negative Regulation of Cellular Component Organization' (P < 10 −4 , Q < 0.05), suggesting potential mechanisms associated with the evolutionary advantage of their simultaneous deletion.
Comparison with previous TCGA studies. We compared our results with the tumor driving CNAs detected in Refs. 18,19 . On the one hand, several CNAs that we identified by our joint expression/CNA analysis were missed in both of those references. On the other hand, because our algorithm was designed to detect at least five consistently strongly co-expressed genes (Materials and Methods), we do not include the "peak CNAs", as well as those CNVs containing less than five co-expressed genes (Table S6), which were obtained in Refs. 18,19 . Such peak CNAs include those containing MYC, CCND1, METTL1, NKX2-1, EGFR, FGFR1, KRAS, CCNE1, CRKL, CDKN2A, FHIT, WWOX, PTPRD, MACROD2, PRKN, LRP1B, RNA5SP174, PLK2, and RBFOX1 (Figures S11, S12, Table S6). Despite the small number of potential driver genes in peak CNAs, our algorithm can help identify the cooperative effects between those genes. For example, signature MYC ( Figure S11A) consists of genes in the neighborhood of gene MYC. Among them, the long non-coding RNA PVT-1 has the second strongest association with the signature, suggesting that PVT-1 also plays a role in tumorigenesis, consistent www.nature.com/scientificreports/ with the previous conclusion 87 that PVT-1 and MYC have cooperative effect in cancer. Furthermore, FAM84B, another gene adjacent to MYC, is the fifth top-ranked gene associated with the signature, consistent with its identified role 88 of strengthening the function of MYC. Examples of signatures containing less than five genes are those containing ATAD1 and PTEN in 10q23.31 (Table S6), THAP3, 2BTB48, PARK7 in 1p36.31 (Table S6), and STK25, ATG4B, ING5, THAP4 in 2q37.3 (Table S6). All signatures identified on the CNAs listed in Refs. 18,19 can be found in Table S6.

Discussion and conclusion
This paper focuses on detecting pan-cancer genomically co-localized gene co-expression signatures associated with amplicons or deleted regions, identifying several novel pan-cancer CNAs. Such signatures contain oncogenes or tumor suppressor genes and result from the cooperative effect of some of their member genes. We have also found that some amplified regions contain multiple genomically co-localized signatures with different tumorigenesis functions, which are occasionally amplified separately. Previous studies (Refs. 20,24 ) used the association between expression and CNA levels as part of their methods to determine whether a gene is likely to be an oncogene or a tumor suppressor gene. Therefore, many of such previously identified genes are included in our identified genomically co-localized signatures. For example, gene VPS72 and gene PSMD4 are identified as two oncogenes in Ref. 20 , and these two genes are identified as cooperative oncogenes co-expressed in signature VPS72. Gene MED21 and gene CCDC91, two oncogenes independently identified in Ref. 20 , are co-expressed in signature MED21. Genes SYNCRIP and MAP3K7, two tumor suppressor genes reported in Ref. 24 , are identified as components of signature SYNCRIP in this paper. Similarly, tumor suppressor signature CCAR contains three co-expressed tumor suppressor genes, CHMP7, CCDC25, and INTS9, which were identified as independent tumor suppressor genes in Ref. 24 . Our analysis not only indicates that genes may be oncogenes or tumor suppressor genes, but also suggests that the co-expressed genes in a genomically co-localized signature have cooperative effects in tumorigenesis due to their simultaneous amplification or deletion.

Materials and methods
Data preparation. We downloaded harmonized TCGA gene expression data processed by HTSeq-FPKM  Table S1.
The log 2 (1 + X) transformed expression data were normalized using the quantile normalization methods implemented in the limma package from Bioconductor. Genes having zero value across all samples from any type of cancer were excluded from the whole datasets. Gene-level CNA values were inferred from their corresponding CNS data. The CNS data are in the form of log-2-ratio, i.e. zero means a normal diploid number of 2, a positive number means amplification, and a negative number represents deletion. If a gene did not fall into any segment in the CNS data, then its CNA value was inferred by the mean value of its two adjacent segments. Each row of an expression/CNA matrix corresponds to a gene (or a signature), while each column corresponds to a cancer case.
Association measurement. The association measure of mutual information (MI) I(A; B) between two random variables A and B is defined by the expected value of −log p A p B /p AB , where p A and p B are the marginal distributions and p AB is the joint probability density. We use a spline-based estimator with six bins in each dimension to estimate the MI 90 given the two vectors representing the variables. We normalize this estimate by dividing by the maximum of the estimated I(A; A) and I (B; B) , so that the result has a maximum value of 1 representing complete corlation beeen two variables, and a minimum value of zero representing independence between two variables. We multly by − 1 whenever the Pearson correlation between A and B is negative, so the final association measure can take values between − 1 and 1.
If variables A and B both exist in all types of cancer, then the pan-cancer association between A and B is defined by the weighted median of the normalized MIs between A and B across all types of cancer, where the weights are given by the proportion of samples in each cancer type. Specifically, by using the weighted median, www.nature.com/scientificreports/ Genomically co-localized signature finding algorithm. We first sort all N genes (N = 56,830) based on genomic mid-point and apply a sliding-window preprocessing approach to identify the co-exprsion signatures, as follows. We use each of the N genes as a seed gene, applying the iterative attractor metagene iterative algorithm 5 , considering only the nearest S genes (S/2 at each side, or as many as available at chromosomal ends) of this gene according to the genomic sorting (setting window parameter S = 150 genes, exponent parameter α = 2 and convergence parameter ε = 10 -7 ). The resulting attractor metagene is defined by a weighted average of the expression values of these S + 1 genes. There are S + 1 such weights. The name of the gene with the highest weight is used as the name of this metagene, and the remaining S genes are sorted in terms of their corresponding weights. The strength of each attractor metagene is defined as the fifth highest weight. We filter out metagenes with strength less than 0.5. Therefore, each metagene contains at least five strongly co-expressed genes. The chromosomal range of each metagene is defined by its member genes with weight larger than 0.5. Attractor metagenes with overlapped chromosomal ranges are then merged into one cluster, resulting in a total of L clusters, each of which defines a chromosomal range. For each of these resulting L chromosomal ranges, we run the attractor metagene algorithm again, using each of the member genes as a seed within the range. If a chromosomal range yields multiple different attractor metagenes, we select the one with the highest strength to represent the chromosomal range. In the end, we generate L attractor metagenes. We further filter out any attractor metagenes whose top five genes have zero expression values in more than half of the samples. Finally, we filter out the gender-based attractor metagenes located on chromosome X and Y.
Association between the expression levels and the CNA levels of a signature. We use the average of expression/CNA levels of the top five genes of a genomically co-localized signature as a measure of the overall expression/CNA level of this signature. Then the pan-cancer association between the expression levels and CNA levels of a signature (pan-cancer expression-CNA association) is given by the weighted median of the corresponding normalized MIs, where the weights are given by the proportion of samples in each cancer type. We run 10,000 permutations and a random distribution between the permuted expression level and CNA level of each signature in each type of cancer is generated. For each signature, its pan-cancer distribution is obtained by the weighted median of its sorted distribution in each type of cancer. The P value of the pan-cancer expression/CNA association is given by the proportion of the pan-cancer distribution larger than the pan-cancer www.nature.com/scientificreports/ expression/CNA association and later adjusted using Bonferroni correction. We assume that P = 0.05 defines the threshold of statistical significance.
Signatures located on amplicons or deleted regions across multiple types of cancer. We set thresholds t amp and t del to identify genomically co-localized signatures located on amplified or deleted regions, to be selected so that genes with CNA values larger than t amp are amplified and genes with CNA levels smaller than t del are deleted. The thresholds t amp and t del are set using the empirical distribution of CNS levels in normal samples. vels of a normal sample are first subtracted by the mean CNS value of this sample. Then, for each cancer type c , we obtain t amp|c ( t del|c ) using the mean value of the top (bottom) 10 percentile CNA values from all the samples in this cancer type. The thresholds t amp ( t del ) are calculated by the weighted median of t amp|c ( t del|c ) across the eighteen types of cancer. This gives t amp = 0.45 and t del = −0.62. The amplification and deletion frequencies of each genomically co-localized signature are calculated in each of the eighteen types of cancer. A signature is classified as amplified (deleted) in one type of cancer if its amplification (deletion) frequency is larger than t freq , which is empirically set to 3%. We assume that if a signature is amplified (deleted) in more than 6 types of cancer, then this signature is located on a pan-cancer amplicon (deleted region) and assume two adjacent signatures are co-amplified or co-deleted if they have a CNA difference less than 0.1. www.nature.com/scientificreports/