Integrated genomic and molecular characterization of cervical cancer

Cervical cancer remains one of the leading causes of cancer-related deaths worldwide. Reported here is an extensive molecular characterization of 228 primary cervical cancers, the largest comprehensive genomic study of cervical cancer to date. We observed striking APOBEC mutagenesis patterns and identified SHKBP1, ERBB3, CASP8, HLA-A, and TGFBR2 as novel significantly mutated genes in cervical cancer. We also discovered novel amplifications in immune targets CD274/PD-L1 and PDCD1LG2/PD-L2, and the BCAR4 lncRNA that has been associated with response to lapatinib. HPV integration was observed in all HPV18-related cases and 76% of HPV16-related cases, and was associated with structural aberrations and increased target gene expression. We identified a unique set of endometrial-like cervical cancers, comprised predominantly of HPV-negative tumors with high frequencies of KRAS, ARID1A, and PTEN mutations. Integrative clustering of 178 samples identified Keratin-low Squamous, Keratin-high Squamous, and Adenocarcinoma-rich subgroups. These molecular analyses reveal new potential therapeutic targets for cervical cancers.

observation resembles findings in bladder cancer 9 and HPV-positive head and neck squamous cell cancers (HNSC) 10 , but it differs from observations in breast and most other cancers 11 . The underlying nucleotide substitution pattern in the E542K and E545K mutations is associated with mutagenesis by a subclass of APOBEC cytidine deaminases 8,[12][13][14][15] , with 150 of 192 exomes displaying statistically significant (q<0.05) enrichment (up to 6-fold) for the APOBEC signature. Further, the APOBEC mutation load correlated strongly with the total number of mutations per sample (Extended Data Fig. 2h), suggesting that APOBEC mutagenesis is the predominant source of mutations in cervical cancers.
Structural rearrangements were identified through analysis of RNA-seq (Core Set, n=178) and whole genome sequencing (WGS) data with low-pass (n=50) and deep (n=19) coverage. Both RNA-seq and WGS detected 22 putative structural rearrangements in 14 patients (Supplemental Table 8). In total, 26 recurrent fusions were found (Supplemental Table 9, with examples in Extended Data Fig. 4d). RNA-seq analysis revealed 4 cases harboring 16p13 ZC3H7A-BCAR4 gene fusions, with exon 1 of ZC3H7A linked to the last exon of BCAR4. Whole genome sequencing revealed tandem duplication and copy number gain of BCAR4 on chromosome 16p13.13 (Extended Data Fig. 4c). BCAR4 is a metastasispromoting lncRNA that enhances cell proliferation in estrogen-resistant breast cancer by activating the HER2/3 pathway. Lapatinib, an EGFR/HER2 inhibitor, counteracts BCAR4driven tumor growth in vitro, and warrants evaluation as a possible therapeutic agent in BCAR4-positive cervical cancer 19 .

Integrated analysis of molecular subgroups and pathways
Integration of copy number, methylation, mRNA, and miRNA data using iCluster 20 highlighted the molecular heterogeneity of cervical carcinomas. Three clusters were identified that largely corresponded to mRNA clusters (Supplemental Fig. S9): a squamous cluster with high expression of keratin gene family members (Keratin-high), another squamous cluster with lower expression of keratin genes (Keratin-low), and an Adenocarcinoma-rich cluster (Adenocarcinoma). Keratin-high and Keratin-low clusters included 133 of 144 squamous cell carcinomas and the Adenocarcinoma cluster contained 29 of 31 adenocarcinomas (Fig. 2). KRAS (p=9.7e-5), ERBB3 (p=2.6e-3), and HLA-A (p=0.03) mutations were significantly associated with iClusters, with KRAS mutations absent from the Keratin-high cluster and HLA-A mutations missing in the Adenocarcinoma cluster (Fig. 2). Members of the SPRR and TMPRSS cornification gene families and the SMGs ARID1A (p=0.02), NFE2L2 (p=6.9e-6), and PIK3CA (p=0.01) were differentially expressed between Keratin-low and Keratin-high clusters (Extended Data Fig. 4b).
Unsupervised hierarchical clustering of variable DNA methylation probes produced three groups (Extended Data Fig. 5a), including a small "CpG island hypermethylated" (CIMPhigh) cluster, a CIMP-intermediate cluster, and a CIMP-low cluster that were associated with an epithelial-mesenchymal transition (EMT) mRNA score (Extended Data Fig.  5b) 10,21 . Most of the cases in the Adenocarcinoma iCluster were CIMP-high, while the other iClusters contained a mixture of CIMP-intermediate and CIMP-low samples (Fig. 2). Comparing all cervical carcinomas to 120 normal samples drawn from 12 TCGA projects, we identified 1026 epigenetically silenced genes that were methylated to a greater extent in cancers than in normal tissues, including several zinc-finger (ZNF), protease (ADAM, ADAMTS), and collagen (COL) genes (Supplemental Tables 11 and 12).
Unsupervised clustering resulted in 6 miRNA clusters associated with iClusters (p=1.7e-19) (Extended Data Fig. 6a). Samples from the Adenocarcinoma iCluster almost exclusively overlapped with miRNA cluster 5, and were characterized by high expression of miR-375 and low expression of miR-205-5p and miR-944 (Supplemental Table 31). Expression levels of tumor suppressors miR-99a-5p and miR-203a were significantly higher in Keratin-high cluster samples than Keratin-low cluster samples (Supplemental Table 31; p=0.01 and 0.008, respectively). Among miRNAs with significant and functionally validated gene and protein anti-correlations 22 , one large subnetwork involved miR-200-family and other miRs with expression patterns anti-correlated with those of the EMT-related transcription factors ZEB1, ZEB2, and SNAI2, the Hippo and p73 transcriptional co-factor YAP1, the receptor tyrosine kinases (RTKs) ERBB2, ERBB3, and AXL, and the hormone receptor ESR1 (Extended Data Fig. 6b, Supplemental Fig. S17, Supplemental Fig. S18, and Supplemental Table 15).
Reverse Phase Protein Array (RPPA) analysis of 155 samples with 192 antibodies (Extended Data Fig. 1a and Supplemental Table 17) identified three clusters significantly associated with iClusters (p=1.8e-4) and EMT mRNA score (Fig. 3a, c, d and Supplemental Table 16). EMT cluster samples were enriched in the Keratin-low iCluster, while PI3K/AKT and Hormone cluster samples were enriched in the Keratin-high and Adenocarcinoma iClusters, respectively, suggesting distinct pathway activation across integrated cervical cancer subtypes. Differential expression levels of Phospho-MAPK, Phospho-EGFR (Y1068), Phospho-Src (Y416), IGFBP2, and TIGAR between Keratin-high and Keratin-low iClusters suggest diverse activation patterns of RTK, MAPK, PI3K, and metabolic signaling pathways that may underlie the molecular diversity of cervical squamous cancers (Fig. 2).
The core members of each RPPA cluster with the highest silhouette width (>0.02, n=115) were associated with five-year survival ( Fig. 3b; p=6.1e-4), with the EMT group exhibiting worse outcome. Interestingly, this was the only platform where clusters associated with outcomes (Supplemental Figs. S8, S9, S12, and S22; Supplemental Information S6). Samples in the EMT cluster exhibited high "reactive" pathway scores (Supplemental Fig.  S20) 11 , illustrating for the first time in cervical cancer the presence of a subset of stromal "reactive" tumors with high expression of Caveolin-1, MYH11, and Rab11, which also appear in other diseases (Supplemental Table 16) 23 . YAP was the most significantly differentially expressed protein distinguishing EMT cluster samples from all others (Supplemental Table 18; p=1.7e-15) and YAP1 was significantly amplified in the EMT cluster samples compared with the Hormone (p=1.1e-5) and PI3K/AKT cluster (p=6.4e-4) samples. Regulation of the EMT-related molecules YAP and ZEB1 24-26 may also be driven by significantly lower expression levels of miR-200a-3p in the EMT cluster samples compared with other RPPA cluster samples (Extended Data Fig. 6b and Extended Data Fig.  7a; p=3.8e-3). These results highlight potential roles for YAP and reactive stroma in the context of EMT-regulated cervical cancer progression.
The Mutual Exclusivity Modules in cancer (MEMo) algorithm 27 uses somatic mutation and copy number data to identify oncogenic networks with mutually exclusive genomic alterations. Since miR-200a and miR-200b (miR-200a/b) expression were negatively correlated with EMT mRNA scores (Extended Data Fig. 7b, d), we used MEMo to examine alterations in miR-200a/b and EMT networks and found a potential link between TGFβ pathway and miR-200a/b alterations in regulating EMT 28,29 . Deletions and mutations affecting the receptor gene TGFBR2, the modulating genes CREBBP and EP300, and the transcription factor SMAD4 all likely impinge on growth suppressive and pro-apoptotic functions driven by TGFβ (Fig. 4c) and were observed in 30% of squamous cell carcinomas (Fig. 4d). Tumors with both hypermethylation and downregulation of miR-200a/b (referred to as altered) were restricted to squamous cell carcinomas, were enriched in the Keratin-low iCluster ( Fig. 4d and Extended Data Fig. 8; p=0.001 for both miR-200a and miR-200b), showed significant upregulation of both ZEB1 and ZEB2 (Extended Data Fig. 9a-d), and were mutually exclusive with TGFβ signaling pathway alterations (Fig. 4d). Importantly, samples with altered miR-200a/b exhibited higher EMT mRNA scores than unaltered samples, while there was no significant difference between samples with or without TGFβ pathway alterations ( Fig. 4d and Extended Data Fig. 7c, e). These findings highlight potential treatment approaches for this subgroup of cervical cancer patients, as targeting EMT may render tumors more sensitive to small molecule inhibitors and cytotoxic chemotherapy 21,30,31 .
MEMo analysis also showed differences in therapeutically-relevant RTK, PI3K, and MAPK pathway alterations across cervical cancers. MEMo identified mutual exclusivity modules involving alterations within both the PI3K and MAPK pathways (Supplemental Table 27; adjusted p=0.06); however, there was a strong tendency for co-occurrence of ERBB2 and ERBB3 alterations within adenocarcinomas (p<0.001, log odds-ratio > 3), indicating that a subset of these tumors may exhibit aberrant HER3 signaling through interactions between mutant HER3 and activated HER2 and therefore could potentially benefit from HER2-and HER3-targeted therapies (Fig. 4a, b) 32 . Although not statistically significant, aberrations in PIK3CA also tended to co-occur with PTEN somatic mutations and deletions (p=0.078, logodds ratio=0.71), which is similar to copy number-low endometrial tumors and suggests potential therapeutic benefit from PI3K pathway targeting agents 17 .

Cross-cancer analysis
To evaluate the relationship of cervical cancer subtypes with endometrial cancer, an adjacent cancer site with hormone-related carcinogenesis, and HNSC, a subset of which is caused by HPV, hierarchical clustering of cervical, uterine corpus endometrial (UCEC) 17 , and HNSC 10 mRNA expression data was performed. Three major groups were observed, with Cluster 1 including all UCEC samples and most cervical adenocarcinomas and characterized by overexpression of hormone receptor genes ESR1 and PGR (Extended Data Fig. 4a). Cluster 2 included predominantly squamous cervical carcinomas and 23/27 HPV-positive HNSC samples. Cluster 3 included few cervical cancers and the remaining HNSC cancers, which were mostly HPV-negative. This highlights the similarity of HPV-related squamous cancers at different anatomical sites.
Since a subset of cervical cancers clustered with endometrial samples, a gene expression classifier was developed to predict whether carcinomas were cervical or endometrial (Supplemental Information S5). We classified 8 of 178 (4.5%) cervical cancer samples as endometrial-like (UCEC-like) cancers, which were confirmed to be cervical cancers by study pathologists (Extended Data Fig. 1f, g). These tumors included 7 of 9 HPV-negative cancers and 5 of the 8 were adenocarcinomas. Six UCEC-like cancers were in the Adenocarcinoma iCluster and 2 were in the Keratin-low iCluster. Despite their low number, the UCEC-like tumors accounted for 33%, 27%, and 20% of mutations in ARID1A, KRAS, and PTEN, respectively. They were associated with the RPPA Hormone and miRNA C6 clusters, and all but one sample was CIMP-low and CN low (Supplemental Table 1). Page 6 Nature. Author manuscript; available in PMC 2017 July 23.
HPV A7 types were enriched in Keratin-low and Adenocarcinoma iClusters (p=5e-4). Most HPV clade A7 tumors were CIMP-low, and HPV-negative tumors formed a distinct subgroup within the CIMP-low cluster with a significantly lower mean promoter methylation level than other samples in that cluster (Extended Data Fig. 5a; p=5e-3). Samples with the highest rate of silencing were HPV-positive adenocarcinomas, particularly those related to A9 types (t-test p-values <0.001). Functional Epigenetic Module (FEM; Supplemental Information S13) analysis 39 , which integrates DNA methylation and gene expression data using protein-protein-interaction networks, identified inverse correlations between methylation and gene expression in HPV-positive vs. HPV-negative cervical cancers and HPV-positive (n=36) vs. HPV-negative (n=243) HNSCs. The analysis revealed 12 statistically significant subnetworks for cervical cancer and 11 for HNSCs, with one common subnetwork centered around Forkhead Box A2 (FOXA2) (Supplemental Table 19 and Supplemental Fig. S32). miR-944, miR-767-5p, and miR-105-5p were the most differentially expressed miRNAs between HPV-positive and HPV-negative samples (Supplemental Fig. S14e). miR-944 expression was also significantly higher while miR-375 expression was significantly lower in HPV16-positive squamous cancers compared with HPV18-positive squamous cancers (Supplemental Fig. S14d). Interestingly, HPV-negative cancers displayed a significantly higher EMT mRNA score and a lower frequency of the APOBEC mutagenesis signature compared with HPV-positive tumors (Extended Data PARADIGM was used to evaluate molecular pathways differentially activated in squamous samples with A7 and A9 HPV infections. We observed higher inferred activation of p53 and p63 signaling and lower FOXA1 signaling in tumors infected with A9 types ( Fig. 5a and Supplemental Fig. S23a). Higher SFN pathway activation was also observed for A9-positive tumors, which is consistent with the low methylation and high gene expression patterns of SFN revealed by FEM analysis (Fig. 5a and Supplemental Table 19). Interestingly, the SFNencoded Stratifin/14-3-3σ adapter protein has previously been associated with epithelial immortalization and squamous cell cancers 40,41 , altered p53 pathway activation 42 , and Wntmediated β-catenin signaling 43 .
Viral-cellular fusion transcripts indicating integration of HPV into the host genome were observed in 141 of 169 (83%) HPV-positive cancers, including all HPV18-positive cancers. Of these 141 cases, 90 (64%) had a single HPV integration event, 35 had two events, and 16 had three or more events (totaling 220 unique integration events) (Supplemental Table 3). HPV integration events affected all chromosomes, including some previously described hotspots such as 3q28 and 8q24 (Fig. 5b) 44 . Genomic loci affected by integration were characterized by increased SCNAs (p=6.9e-13 for HPV16 and p=0.058 for HPV18) and increased gene expression (p=1.6e-11 for HPV16 and p=0.011 for HPV18) (Extended Data Fig. 11c, d). One hundred fifty-three (70%) fusion transcripts included known or predicted genes, while the remainder included intergenic regions ( Fig. 5b and Supplemental Table 3).

Conclusion
Through comprehensive molecular and integrative profiling, we identified novel genomic and proteomic characteristics that subclassify cervical cancers. Integrated clustering identified Keratin-low squamous, Keratin-high squamous, and Adenocarcinoma-rich clusters defined by different HPV and molecular features (Extended Data Fig. 8). ERBB3, CASP8, HLA-A, SHKBP1, and TGFBR2 were identified as SMGs for the first time in cervical cancer, with ERBB3 (HER3) immediately applicable as a therapeutic target. Notably, we report amplifications and fusion events involving the BCAR4 gene for the first time in cancer, which can be targeted indirectly by lapatinib. Further, we identified amplifications in CD274 and PDCD1LG2, two genes that encode for well-known immunotherapy targets. A set of endometrial-like cervical cancers comprised predominantly of HPV-negative tumors and characterized by mutations in KRAS, ARID1A, and PTEN was discovered, with PTEN and potentially ARID1A proteins serving as therapeutic targets. Importantly, over 70% of cervical cancers exhibited genomic alterations in either one or both of the PI3K/MAPK and TGFβ signaling pathways (Extended Data Fig. 9e), illustrating the potential clinical significance of therapeutic agents targeting members of these pathways. For the first time, we report distinct molecular pathways activated in cervical carcinomas caused by different HPV types, highlighting the biologic diversity of HPV.
Together, these findings provide insight into the molecular subtypes of cervical cancers and rationales for developing clinical trials to treat populations of cervical cancer patients with distinct therapies.  Table 2). Of these cases, 14 had mutations called and 60 had RPPA data available; however, RPPA data for 17 cases was excluded due to low Page 8 Nature. Author manuscript; available in PMC 2017 July 23. protein content within samples (Supplemental Table 2). Mutations were called for 192 samples (Extended Set), while all other platform and integrated analyses (aside from protein) were performed on the subset of 178 Core Set samples. Protein levels were measured on 155 samples, which included 119 total samples from both the Core and Extended Sets as well as 36 samples outside of these sets. The total number of nonoverlapping samples across Core, Extended, and RPPA datasets is 228 (Extended Data Fig.  1a).

HPV detection, variant calling, and transcript analysis
HPV status was determined using consensus results from MassArray and RNA-seq (Supplemental Information S2). MassArray uses real-time competitive polymerase chain reaction and matrix-assisted laser desorption/ionization-time of flight mass spectroscopy with separation of products on a matrix-loaded silicon chip array, similar to the work described in Tang et al 45 . Two approaches for pathogen detection from RNA-seq data were used. The first used the microbial detection pipeline at the British Columbia Cancer Agency's Genome Sciences Centre (BC), which is based on BioBloom Tools (BBT, v1.2.4b1) 46 . The second used the PathSeq algorithm 47 at the Broad Institute (BI) to perform computational subtraction of human reads followed by alignment of residual reads to a combined database of human reference genomes and microbial reference genomes including HPV. In 97% of samples, complete agreement between MassArray and both RNA-seq approaches was observed. The remaining discrepant samples were resolved by majority decision, assigning the genotype called by at least two of the methods. RNA-seq data in FASTA format was used to identify HPV variants (Supplemental Fig. S1). Unaligned reads were taken from the PathSeq analysis and aligned to HPV reference genomes using TopHat 48 with default parameters 49 . The HPV variant lineages/sublineages were assigned based on the phylogenetic topology and confirmed visually using the SNP patterns 50 . HPV splice junctions from RNA-seq were determined using TopHat. Two transcript types were distinguished for HPV16 and HPV18: (a) transcripts that included evidence of an unspliced sequence of E6, and (b) a transcript spliced at the E6 splice donor site (position 226 for HPV16 and position 233 for HPV18) (Supplemental Fig. S2). Read counts for unspliced, spliced, as well as the ratio of unspliced/spliced transcripts were categorized into quartiles separately for HPV16 and HPV18.

HPV integration analysis
Using RNA-seq data, concordance of integration events based on alignments of contigs from de novo transcriptome assembly (BC) and read alignments (BI) was evaluated (Supplemental Fig. S3). We identified method-specific integration events by assigning all sites within a 500-kb sliding window to a single integration event located at the median coordinate of that event's assigned sites. An integration event was labeled as 'confident' when the total read support for each of its supporting integration sites passed center-specific read evidence thresholds. To take advantage of differences between the two integration methods (i.e. contig and read), for the concordance analysis we used all method-specific integration events (both confident and non-confident events). We labeled an integration event as 'concordant' when both methods reported an integration event within 500 kb in the same patient. For some concordant events, both methods reported a confident event. An integration event was labeled as 'discordant' when only one center reported a confident integration event within 500 kb (Supplemental Figs. S4 and S5). For both intragenic and intergenic concordant events, we reported a range of coordinates that extends from the most proximal to the most distal supporting integration site. We assessed gene-level expression relative to somatic copy number and structural variant data for genes into which we had mapped viral-human junctions from RNA sequencing data and for genes that were associated with enhancers into which we had mapped RNA junctions.

DNA sequencing and mutation calling
Detailed methods for library hybrid capture, read alignments, and somatic variant calling are documented in Supplemental Information S3. MutSig2CV 6 was utilized to identify significantly mutated genes (SMGs) within the cervical cancer exome sequencing data. Mutations were analyzed for the Core Set plus 14 samples to total 192 Extended Set samples. Eleven samples were identified to exhibit greater than average mutations rates and were termed "hypermutants" (somatic mutations >600). These 11 samples were excluded from the analysis for identifying SMGs. All 3 sample subsets (all samples, squamous carcinomas only, adenocarcinomas only) without "hypermutants" (Supplemental Table 4) were analyzed using an FDR cutoff of 0.1. FDR values are shown in Supplemental Table 4. SMG analysis using the entire sample cohort in Ojesina et al. was performed as described previously 8 .

Copy number analysis
DNA from each tumor or germline sample was hybridized to Affymetrix SNP 6.0 arrays using protocols at the Genome Analysis Platform of the Broad Institute as previously described 51 . Briefly, Birdseed was used to infer a preliminary copy number at each probe locus from raw. CEL files 52 . For each tumor, genome-wide copy number estimates were refined using tangent normalization, in which tumor signal intensities are divided by signal intensities from the linear combination of all normal samples that are most similar to the tumor 16 . Individual copy number estimates then underwent segmentation using Circular Binary Segmentation 53 , and segmented copy number profiles for tumor and matched control DNAs were analyzed using Ziggurat Deconstruction 54 . Significance of copy number alterations were assessed from the segmented data using GISTIC2.0 (Version 2.0.22) 54 . For the purpose of this analysis, an arm-level event was defined as any event spanning more than 50% of a chromosome arm. For copy number based clustering, tumors were clustered based on log2 copy number at regions revealed by GISTIC analysis. Clustering was done in R based on Euclidean distance using Ward's method. Allelic and integer copy number, tumor purity, and tumor ploidy were calculated using the ABSOLUTE algorithm 55 .

Detecting structural variants from RNA-seq and WGS data
Integrative analysis was performed to identify putative driver fusions using both WGS (lowpass and high-coverage) and RNA-seq data. RNA-seq data for 178 Core Set cases were analyzed using the TopHat-Fusion and BreakFusion, PRADA, and MapSplice algorithms. To identify structural variations in WGS data, 50 low-pass WGS and 19 high-pass WGS samples were analyzed. Detection of structural variations in low-pass WGS data was performed using two algorithms, BreakDancer 56 and Meerkat 57 , with a requirement for at least two discordant read pairs supporting each event and at least one read covering the breakpoint junction. High-pass WGS data were analyzed to detect somatic structural variations using two runs of BreakDancer and one run of SquareDancer (https://github.com/ ding-lab/squaredancer). The gene fusion lists generated by all methods and platforms were integrated (See Supplemental Tables 8-10).

APOBEC mutagenesis analysis
Analysis is based on previous findings that APOBECs deaminate cytidines predominantly in a tCw motif and that the APOBEC mutagenesis signature is composed of approximately equal numbers of two kinds of changes in this motif: tCw→tTw and tCw→tGw mutations (flanking nucleotides are shown in small letters; w=A or T). Using mutation data from all 192 Extended Set samples, we calculated on a per sample basis the enrichment of the APOBEC mutation signature among all mutated cytosines in comparison to the fraction of cytosines that occur in the tCw motif among the +/-20 nucleotides surrounding each mutated cytosine ("APOBEC_enrich" column in data files). The minimum estimate of the number of APOBEC-induced mutations in a sample (APOBEC_MutLoad_MinEstimate) was calculated using the formula: ["tCw→G+tCw→T"]x[("APOBEC_enrich"-1)/ "APOBEC_enrich"], which allows estimating the number of APOBEC signature mutations in excess of what would be expected by random mutagenesis. "APOBEC_MutLoad_MinEstimate" was calculated only for samples passing 0.05 FDR threshold for APOBEC enrichment (["BH_Fisher_p-value_tCw"]<0.05. Samples with "BH_Fisher_p-value_tCw" value greater than 0.05 received a value of 0. The "APOBEC_MutLoad_MinEstimate" value shows high correlation (0.9-0.95) with all other parameters used to characterize the APOBEC mutagenesis pattern, such as APOBEC enrichment as well as absolute and relative APOBEC mutation loads. For some analyses and figures, the "APOBEC_MutLoad_MinEstimate" parameter was converted into categorical values as follows:

3.
"high": "APOBEC_MutLoad_MinEstimate">median of non-zero values The median of non-zero values in the Extended Set = 33.

Methylation analysis
The Illumina Infinium HM450 array 58 was used to evaluate DNA methylation in the Core Set of cervical cancer samples. Unsupervised consensus clustering was performed with Euclidean distance and partitioning around medoids (PAM) using the most variable 1% of CpG island promoter probes. Epigenetically silenced genes were identified as previously described 59 . A total of 120 normal samples were used for this analysis by selecting 10 samples at random from the 12 TCGA projects that included normal samples.

RNA-seq analysis
RNA was extracted, converted into mRNA libraries, and paired-end sequenced (paired 50 nt reads) on Illumina HiSeq 2000 Genome Analyzers as previously described 5 . RNA reads Page 11 Nature. Author manuscript; available in PMC 2017 July 23.
were aligned to the hg19 genome assembly using Mapsplice v12_07 60 . Gene expression was quantified for the transcript models corresponding to the TCGA GAF2.

EMT mRNA score analysis
The EMT score was computed as previously described 10,21 . Briefly, the EMT score was the value resulting from the difference between the average expression of mesenchymal (M) genes minus the average expression of epithelial (E) genes. All NA values were removed from the calculation. Two-sample t-test and ANOVA were applied to each comparison accordingly.

miRNA sequencing and analysis
MicroRNA sequence (miRNA-seq) data was generated for the Core Set of tumor samples using methods described previously 11 . We identified miRNAs that have been associated with EMT 62-66 and then calculated Spearman correlations between the EMT scores and RPMs for 5p and 3p mature strands for each of these miRNAs using MatrixEQTL and filtering by FDR<0.05. An miRNA was considered to be epigenetically controlled if BH-corrected pvalues were less than 0.01 for both a) a Spearman correlation of miRNA abundance (RPM) to beta for probes in promoter regions associated with the miRNAs, and for b) a t-test of RPM between unmethylated (β<0.1) and methylated (β>0.3) samples (an "epigeneticallycontrolled pattern"). We assessed potential miRNA targeting for all 178 samples and then separately for the 144 squamous samples by calculating miR-mRNA and miR-protein (RPPA) Spearman correlations with MatrixEQTL v2.1.1 using gene-level normalized abundance RNA-seq (RSEM) data and normalized RPPA data. Correlations were calculated with a p-value threshold of 0.05, and then the anti-correlations were filtered at FDR<0.05. We extracted miR-gene pairs that corresponded to functional validation publications reported by miRTarBase v4.5 22 . For miR-RPPA anti-correlations, all gene names that were associated with each antibody were used. Results were displayed with Cytoscape v2.8.3.
which were then log2-transformed and median-centered across samples for each gene. The log2-transformed, median-centered mRNA data were rank-transformed based on the global ranking across all samples and all genes and discretized (+1 for values with ranks in the highest tertile, -1 for values with ranks in the lowest tertile, and 0 otherwise) prior to PARADIGM analysis.
Pathways were obtained in BioPax Level 3 format, and included the NCIPID and BioCarta databases from http://pid.nci.nih.gov and the Reactome database from http://reactome.org. Gene identifiers were unified by UniProt ID and then converted to Human Genome Nomenclature Committee's HUGO symbols using mappings provided by HGNC (http:// www.genenames.org/). Altogether, 1524 pathways were obtained. Interactions from all of these sources were then combined into a merged Superimposed Pathway (SuperPathway). Genes, complexes, and abstract processes (e.g. "cell cycle" and "apoptosis") were retained and henceforth referred to collectively as pathway features. The resulting pathway structure contained a total of 19504 features, representing 7369 protein-coding genes, 9354 complexes, 2092 families, 82 RNAs, 15 miRNAs, and 592 abstract processes.
The PARADIGM algorithm infers an IPL for each pathway element that reflects the log likelihood contrasting the probability of activity against inactivity. An initial minimum variation filter (at least 1 sample with absolute activity > 0.05) was applied, resulting in 15502 concepts (5898 protein-coding genes, 7307 complexes, 1916 families, 12 RNAs, 15 miRNAs, and 354 abstract processes) with relative activities showing distinguishable variation across tumors.

iCluster analysis
Integrative clustering of RNA-seq, methylation, copy number, and miRNA data was performed using R package "iCluster 20 ." The Core Set of samples was used since all samples in this Set had data available across these four platforms. RNA-seq, methylation, copy number, and mature-strand miRNA datasets had 20531, 395552, 23109, and 1213 features, respectively. The 500 most variable features based on the standard deviation from each dataset were selected for the integrative clustering analyses. For analysis involving the RNA-seq and miRNA datasets, a log(x+1) transformation was used in order to deal with skewness in the data 67 . Methylation data was logit transformed to make it closer to normal distribution. The copy number variation data included the regions determined from GISTIC2.0, with copy number variation treated as a continuous measurement based on the segmentation mean value for the region.

MEMo analysis
High DNA methylation levels upstream of miR-200a and miR-200b corresponded to transcriptional downregulation of the miRs (Extended Data Fig. 9a). For a sample to be called altered for either miR-200a or miR-200b (or both), we required both high DNA methylation level upstream of the miR (β-value>0.3) and low miR expression (log2(RPM) < 9.3 for miR-200a and log2(RPM) < 9 for miR-200b). Binary calls were given to altered and unaltered samples based on this double threshold (1 = altered, 0 = unaltered).
The Mutual Exclusivity Modules in cancer (MEMo) algorithm 27 was run on all Core Set samples. MEMo was initially run on 27 regions of recurrent copy number gain, 36 of copy number loss, and 22 recurrently mutated genes. In order to include alterations for miR-200a and miR-200b in the MEMo analysis, a custom network was designed where each miR was connected to its known and validated targets (see above). Second, this network was merged with the comprehensive pathway network used by MEMo to search for modules of altered genes that include at least one of the miRs. Extracted modules were tested for mutual exclusivity using MEMo's statistical framework (Supplemental Table 27). Student's t-test was performed for comparing EMT mRNA scores between groups.
has traditionally been termed "non-keratinizing" because of the absence of characteristic keratin pearls. c, Squamous cell carcinoma of the large cell keratinizing type. Nests of atypical squamous cells infiltrate through a fibrotic stroma. In addition, this tumor shows highly eosinophilic keratin pearls with small, inky dark nuclei that imperfectly mimic the normal keratinization that is found in the epidermis. This differentiation pattern is aberrant in the cervix in which the squamous epithelium is normally a non-keratinizing squamous mucosa. d, Adenocarcinoma of endocervical type (well-differentiated). Closely set, atypical glands with enlarged nuclei and scattered mitotic figures infiltrate through the connective tissue of the cervix. The tall columnar tumor cells show basally-placed, crowded, enlarged nuclei that show frequent mitotic figures. Compared with normal endocervical cells, the tumor cells show relative loss of intra-cytoplasmic mucin and are frequently called "mucindepleted," although most, but not all endocervical adenocarcinomas show varying amounts of intracytoplasmic mucin at least focally. e, Adenosquamous carcinoma of cervix. This tumor shows both nests of non-keratinizing squamous cell carcinoma and glands composed of tall columnar adenocarcinoma reflecting the origin of most cervical cancers in the transformation zone of the cervix in which both squamous and glandular cells normally differentiate. Despite this biphasic differentiation potential, adenosquamous carcinomas are relatively uncommon in the cervix. f, UCEC-like HPV negative adenocarcinoma of endocervical type from a radical hysterectomy specimen. The endometrium in the uterus was benign. g, UCEC-like HPV positive adenocarcinoma of endocervical type from a radical hysterectomy specimen. The endometrium in the uterus was benign. All samples were stained with hematoxylin and eosin (20×).
portion of a gene and each highlighted section represents the UniProt functional domain.
Vertical lines indicate the boundaries of multiple annotation sources within common domain annotations as outlined in Supplemental Table 5. Horizontal lines distinguish overlapping domains. Circles represent a single mutation and are colored based on mutation type. Mutations present in squamous cell carcinomas are outlined in black while those present in adenocarcinomas are outlined in pink. g, PIK3CA mutations and recurrence are shown in a stacked circle plot, as above. Additionally, lolliplot sticks are colored red if the mutation type coincides with patterns of APOBEC mutagenesis. h, The minimal estimated number of APOBEC-induced mutations ("APOBEC_MutLoad_MinEstimate" column in Supplemental Table 1) strongly correlates with total number of mutations in a sample, as well as with the number of single nucleotide variants (SNVs) in G:C pairs which are the exclusive substrate for mutagenesis by APOBEC cytidine deaminases. While correlation with mutagenesis in A:T base pairs, which cannot be mutated by APOBEC enzymes is statistically significant (two-tailed P=0.047), it is very weak. Pearson correlation and R 2 were calculated for all 192 exome-sequenced samples, including samples with zero values. Only samples with non-zero values of "APOBEC_MutLoad_MinEstimate" are presented.
number clusters. Peaks are annotated with cytoband and candidate driver genes. The total number of genes in the peak region is indicated in parenthesis. Peaks with more than 30 genes in the peak region are excluded. Any genes annotated have a significant positive correlation with mRNA expressions. c, Chromosomal locations for peaks of significantly recurrent focal amplifications (red, right side) and deletions (blue, left side) are plotted by −LOG10 q-value for all Core Set samples. Peaks are annotated with cytoband and candidate driver genes. The total number of genes in the peak region is indicated in parentheses. Peaks consisting of more than 30 genes in the peak region are excluded. Annotated genes have a significant positive correlation with mRNA expression. d, Cytolytic activity (CYT) associations with PDL-1/2 amplification. Each bar represents a single tumor and the height of that bar represents the z-score of that tumor's CYT compared with the rest of the cohort. Bars are colored according to their PD-L1/2 amplification status and sorted from high zscores to lowest.     Integrative clustering of 178 Core Set cervical cancer cases using mRNA, methylation, miRNA, and copy number (CNV) data identifies two squamous carcinoma-enriched groups (Keratin-low and Keratin-high) and one adenocarcinoma-enriched group as shown in the feature bars. Features presented include histology, HPV clade, HPV integration status, UCEC-like status, APOBEC mutagenesis level, mRNA EMT score, tumor purity, and three SMGs that are significantly associated across the three iClusters (ERBB2 is presented for comparison purposes with its family member ERBB3). The cluster of cluster panel displays subtypes defined independently by mRNA, miRNA, methylation, reverse phase protein array (RPPA), CNV, and PARADIGM data. Black indicates that the sample is not represented in the cluster, red indicates that the sample is represented in the cluster, and gray represents data not available. The bottom heatmap panel shows select mRNAs, miRNAs, proteins, and CNVs that are either significantly associated with iCluster groups or identified as markers in other analyses. The heatmap color scale bar represents the scale for the features presented in the heatmap panel with a breakpoint of zero represented by white. APOBEC Mut., APOBEC Mutagenesis; inter., intermediate.   1a). Clusters presented from left to right include Hormone (dark blue), EMT (red), and PI3K/AKT (green). A subset of proteins differentially expressed between the clusters is highlighted. Clinical and molecular feature tracks are shown for those features which were significantly associated with RPPA clusters (p<0.05). Correlation between RPPA clusters and other categorical variables were detected by Chi-Squared test, while correlations with continuous variables were examined using the non-parametric Kruskal-Wallis test. In the heatmap blue color represents downregulated expression, red represents upregulated expression, and white represents no change in expression. NA represents data not available. b, Five-year Kaplan-Meier survival curves and log-rank test's p-value comparing overall survival (OS) across all RPPA clusters using 115 Silhouette Width Core samples (Silhouette Core; see Supplemental Information S8). c, EMT mRNA score levels were calculated for all samples and compared across RPPA clusters. A significant p-value is presented for a one-way ANOVA analysis. d, Pathway scores for EMT, hormone receptor, and PI3K/AKT signaling pathways are presented for all RPPA clusters  . HPV integration and differential pathway activation between HPV subtypes a, Cytoscape display of the largest interconnected regulatory network of PARADIGM integrated pathway level (IPL) features showing differential inferred activation between HPV A9 and A7 squamous carcinomas (n= 101 and n=35, respectively). Node color and intensity reflect the level of differential activation. Node size represents level of significance. Regulatory nodes with at least 5 downstream targets are highlighted in bold text. SFN is within a subnetwork identified by Functional Epigenetic Module (FEM) analysis (Supplemental Information S13) as disrupted between HPV A9 and A7 squamous cell carcinomas, and is highlighted using a bold black outline. b, Circos plot showing frequency (0-100%) of gains and losses for regions of each chromosome (outer circle). Lines within inner circle indicate integration breakpoints from the HPV genome to the human genome as defined in Methods, Supplemental Information S2, and Supplemental Table 3. Lines are color coded by HPV clade. Page 45 Nature. Author manuscript; available in PMC 2017 July 23.