Systematic identification of genes with a cancer-testis expression pattern in 19 cancer types

Cancer-testis (CT) genes represent the similarity between the processes of spermatogenesis and tumorigenesis. It is possible that their selective expression pattern can help identify driver genes in cancer. In this study, we integrate transcriptomics data from multiple databases and systematically identify 876 new CT genes in 19 cancer types. We explore their relationship with testis-specific regulatory elements. We propose that extremely highly expressed CT genes (EECTGs) are potential drivers activated through epigenetic mechanisms. We find mutually exclusive associations between EECTGs and somatic mutations in mutated genes, such as PIK3CA in breast cancer. We also provide evidence that promoter demethylation and close non-coding RNAs (namely, CT-ncRNAs) may be two mechanisms to reactivate EECTG gene expression. We show that the meiosis-related EECTG (MEIOB) and its nearby CT-ncRNA have a role in tumorigenesis in lung adenocarcinoma. Our findings provide methods for identifying epigenetic-driver genes of cancer, which could serve as targets of future cancer therapies.


Supplementary Tables
Supplementary Table 1 To estimate the expression of the non-coding RNAs (ncRNAs) in Illumina Human Body Map (HBM) and compare with the ncRNAs with testis-specific expression patterns (TS-ncRNAs) identified by the GTEx, we downloaded raw FASTQ files from E-MTAB-513 and performed a comprehensive RNA-seq analysis on 14 normal tissues from HBM. Initial sequence quality was evaluated using FASTQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc).
Cutadapt (http://code.google.com/p/cutadapt/) was used to trim Illumina sequencing adaptors and poor-quality bases with a quality score of 20 and discard reads with a length below 30 bp after trimming. Reads were mapped to the reference genome (GENCODE Version 19, http://www.gencodegenes.org/releases/19.html) using TopHat2 1 (v2.0.9) with default parameters. Reference genome annotation files and the transcriptome reference gene set were downloaded from the GENCODE v19 databases. Cufflinks 2 (v2.2.1) was used to assemble transcripts and to estimate expression abundances with the parameter "-G."

Database combination and gene annotation
In this study, ENSEMBL ID from GENCODE v19 was regarded as the official indicator for further analysis. All databases annotated by other references were re-annotated by an R package biomaRt 3,4 . Any genes/proteins that failed to annotate unambiguously were excluded from the subsequent analysis.

Methods to evaluate testis-specific genes (TSGs)
In this study, the specificity measure 5 (SPM) was used to evaluate the testis-specific expression pattern.
Each gene expression profile is transformed into a vector : where n is the number of tissues in the profile. Similarly, a vector can be generated to represent the gene expression in testis: SPM is the cosine value of the intersection angle θ between vectors and in high dimensional feature space. This variable is calculated by the following expression: where | | and | | are the length of vectors and , respectively. SPM values range from 0 to 1, with values close to 1 indicating a major contribution to gene expression in a testis (vector ) relative to all other tissues (vector ). Testis-specific genes were defined as genes with SPM higher than 0.9, thus including both testis-restricted and testis-selective genes.

Protein expression quantification in human protein map (HPM)
Normalized spectral counts data were downloaded from http://www.humanproteomemap.org/download.php. Because the SPM distribution calculated from protein spectral counts was similar to the SPM distribution calculated from mRNA abundance (Supplementary Figure 9), we chose the same cutoff (0.9) to identify testis-specific proteins (TSPs).

Enrichment analysis of testis-specific regulatory elements (TSREs)
In this study, we performed enrichment analysis to evaluate the relationship between the TSREs and the TSGs. Four types of regulatory elements were included in the analysis (promoter, methylation level, ncRNA and enhancer).
Genes from C2 and C4 groups were considered as testis-specific non-coding RNAs (TS-ncRNAs) in our analysis. To avoid ambiguous mapping which derived from overlapping exons of protein-coding genes, we excluded ncRNA that overlapped with the exons of protein-coding genes in the same strand.
The activity of promoters and enhancers was estimated by Cap Analysis of Gene Expression (CAGE) from the Fantom project 6 . We downloaded CAGE expression levels

mRNA expression quantification in TCGA data
We obtained level 3-normalized TCGA RNA-seqV2 expression quantification data from Quality control processes followed the same protocols for handling RNA-seq data in normal tissues. Gene expression was quantified for the transcript models corresponding to the GENCODE v19 using RSEM and normalized within sample to a fixed upper quartile.

The definition of EE and activated EECTG/Ps
For each EECTG/EECTP, all samples were classified as activated samples or inactivated samples based on whether their expression exceeded the extremely high expression cutoff ( ) and were recoded as 1 and 0 respectively. For each sample, number of activated EECTPs (count of EECTPs which were coded as 1) was used to represent the degree driven by EECTPs. In our LUAD validation, because the expression of MEIOB of sample 130717001 approaches the extremely-high expression criteria and its co-factor SPATA22 is validated, we consider it as a validated EECTP and include it in the further functional assay.

Obtaining and processing somatic mutation data sets
As described in the result sections, we obtained somatic mutation information to explore the relationship between the EE pattern of EECTPs and somatic mutations. Mutation data were downloaded from the Synapse platform (syn1729383) as "maf" files within the context of the PANCANCER project. Only cancer types with more than 100 samples with both expression and mutation data were included in the analysis, and EECTPs were redefined using data of platform overlapped samples. Impact scores given by the IntOGen-mutations Web discovery tool (http://www.intogen.org/search) were used to evaluate the potential functions of mutations.
Significantly mutated genes (SMGs) of each cancer were obtained from Supplementary Table   4 of a previously published paper 9 . The mutation ratio represented the degree of samples driven by SMG mutations and was calculated as the ratio of the mutation number in SMGs and the mutation number in all genes. Driver summary of papillary thyroid carcinoma were obtained from the Supplementary Table 2 of previous study 10 .
Linear regression was used to evaluate the association between the mutation ratio and activated number of EECTPs. For each SMG, a Wilcoxon's rank sum test was used for the statistical comparison of the activated EECTP number between mutated and non-mutated samples. Fisher's exact test was employed to test mutually exclusive patterns between the SMGs' mutations and EECTPs' EE patterns. P-values were adjusted by Benjamini-Hochberg false discovery rate (FDR-BH). PAM50 subtypes were obtained from the related data from a previous paper 11 .

Obtaining and processing methylation data sets
Seven cancer types had more than 100 samples with both expression and methylation data were included in the analysis, and EECTPs were redefined using data of platform overlapped samples. We downloaded Illumina raw idat-files produced by the Infinium

The definition of CT-ncRNA and data processing
We downloaded the expression quantification of differential expressed ncNRAs from lncrnator 13  Spearman's rank correlation test was used to estimate the correlation coefficient of the expression of CT-ncRNAs and nearby protein-coding CT genes. The cancer types were included in the correlation analysis which had more than ten samples with both expression of CT coding genes and CT-ncRNAs. P-values were adjusted by Benjamini-Hochberg false discovery rate (FDR-BH).

Obtain and processing copy number data
We obtained level 3-focal copy number data from Firehose at the MIT Broad Institute For allele specific copy number analysis, raw .CEL files from genome-wide SNP6.0 microarray data of LUAD samples were preprocessed by R package affy2sv 15 and allele-specific copy number profiling was performed with ASCAT v2.1 16 .
Scores of chromosomal instability scarring (SCINS) were calculated using the following steps of previous study 17 : 1) The proportion of the genome consisting of AiCNA segments, save those segments that encompass a whole chromosome, is calculated.
2) The number of AiCNA segments greater than or equal to 8Mb in length but less than the length of a whole chromosome is counted.
3) The measure of AiCNA segments (S AiCNA ) is calculated by multiplying the proportion obtained in step 1) by the number of segments counted in step 2).
4) The proportion of the genome consisting of CnLOH segments is calculated.
5) The number of CnLOH segments greater than or equal to 4Mb in length, including those that span a whole chromosome, is counted.
6) The measure of CnLOH segments (S CnLOH ) is calculated by multiplying the proportion obtained in step 4) by the number of segments counted in step 5).
7) The measure of AbCNA segments (S AbCNA ) is calculated by counting the number of AbCNA segments greater than or equal to 8Mb in length.
8) The measure of all allelic imbalanced segments (S Ai ) is calculated by summing S AiCNA and S CnLOH .