Integrative Genomics and Transcriptomics Analysis Reveals Potential Mechanisms for Favorable Prognosis of Patients with HPV-Positive Head and Neck Carcinomas

Patients with HPV-positive head neck squamous cell carcinomas (HNSCC) usually have a better prognosis than the HPV-negative cases while the underlying mechanism remains far from being well understood. We investigated this issue by an integrative analysis of clinically-annotated multi-omics HNSCC data released by the Cancer Genome Atlas. As confirmatory results, we found: (1) Co-occurrence of mutant TP53 and HPV infection was rare; (2) Regardless of HPV status, HNSCCs of wild-type TP53 implied a good survival chance for patients and had fewer genome-wide somatic mutations than those with a mutation burden on the gene. Our analysis further led to some novel observations. They included: (1) The genes involved in “DNA mismatch repair” pathway were up-regulated in HPV-positive tumors compared to normal tissue samples and HPV-negative cases, and thus constituted a strong predictive signature for the identification of HPV infection; (2) HPV infection could disrupt some regulatory miRNA-mRNA correlations operational in the HPV-negative tumors. In light of these results, we proposed a hypothesis for the favorable clinical outcomes of HPV-positive HNSCC patients. That is, the replication of HPV genome and/or its invasion into the genomes of cancer cells may enhance DNA repair mechanisms, which in turn limit the accumulation of lethal somatic mutations.


Results
Cancer patient stratification. According to HPV infection (i.e. positive or negative) and the status (i.e. wild or mutant) of genes TP53 and CDKN2A, we stratified the 296 HNSCCs, accompanied by complete clinical and somatic mutation data, into five subtypes. The first subtype, namely tp53&cdkn2a-mut_HPV− (A1), contained HPV-negative tumors with a mutation burden on both TP53 and CDKN2A. Other four subtypes were similarly defined. They include tp53-mut_HPV− (A2), tp53-mut_HPV+ (B), tp53-wild_HPV− (C) and tp53-wild_HPV+ (D). As shown in Table 1, the co-occurrence of mutant TP53 and (positive) HPV infection was rare (n = 2), and a mutant CDKN2A was present only in the samples with a mutation burden on TP53. As subtype "B" contained only two samples, we didn't further consider it in the subsequent analysis. It is worth noting that silent mutations were excluded from our analysis, implying that the tumor stratifications based on the genotypes  Table 1. Sample profiles and stratified subtypes. a In the dataset, all normal tissue samples have mRNAseq and mRNA-seq information, and tumor samples have somatic mutation, mRNA-seq and miRNA-seq information. In our analysis, the mutation set (N = 296) contains 275 HNSCCs that were also involved in the TCGA study (N = 279) (2015) 19 . Among these 275 common samples, only one has different HPV-status between our analysis and the TCGA study. In determining the final HPV-status of a HNSCC, TCGA considered the concordance between the RNA-seq data and other molecular and sequence information, including WGS data.
Subtype-specific survival profiles and somatic mutation spectra. Figure 1 depicts the Kaplan-Meier (K-M) survival curves for the patient groups defined by the tumor stratification. Log-rank test and Cox-PH regression analysis demonstrate that, as a whole, the association between the patient survival and cancer subtypes is statistically significant before and after the initial diagnosis ages are corrected (p < 0.001). This association is primarily due to the difference between the aggregate of subtypes A1 and A2 and the aggregate of subtypes C and D. This result implies that, regardless of the HPV infection status, the patients with wild-type TP53 have better prognosis than the patients with a mutation burden on the gene. The impact of HPV infection on patient survival is demonstrated by the comparison of the K-M curves of subtypes C and D. That is, compared to subtype C, subtype D has a higher 3-year survival rate but its survival advantage is not maintained in that all patients in subtype D were deceased by year 8 while the 8-year survival rate of subtype C is over 0.5. The "double mutant" subtype A1 has a better prognosis than the "single mutant" subtype A2. Their survival curves begin to diverge at the 18 month time point and the difference is marginally significant (p < 0.07). To the best of our knowledge, this is the first work studying the relationship between CDKN2A mutation and the survival of HNSCC patients. The potential scientific merit of this observation is its biological implication (see Discussion section) rather than the prognostic value. This is because the effect of TP53 status is so drastic that it completely overrides the status of CDKN2A. We analyzed the subtype-specific somatic mutation spectra by fitting the empirical cumulative distributions of mutations present on cancer samples. The plots displayed in Fig. 2 are based on three gene sets (catalogues). The first includes all the HUGO genes whose official symbols have been approved by Human Genome Organization. The second contains 435 "cancer driver" genes identified by a pan-Cancer project using the MutSig software 30,31 . The third consists of 506 cancer genes collected in the COSMIC (Catalogue of Somatic Mutations in Cancer) database 32 . We found that, similar to the case of patient survival, the major stratification factor for the number of mutations present in a tumor was the genotype of TP53 in cancer cells. That is, the samples with wild-type TP53, especially those infected by HPVs (e.g. subtype D), have fewer somatic mutations than the samples with a mutation burden on TP53 (e.g. subtypes A1 and A2). While over 30% of samples in subtype C are enriched with mutations on the MutSig cancer diver genes and COSMIC cancer genes, its mutation spectrum in the HUGO genes is similar to that of subtype D as ~20% of the samples have only 10-50 mutations. Most (~ 90%) of HPV-positive tumors have at least one mutation on the cancer genes, indicating that the progression of the cancer initiated by virus infection is driven by somatic mutations, which is similar to the cancer initialized by other carcinogens. By a set of Mann-Whitney tests, we found that the differences in mutation burden are significant (p < 0.01) between subtype D and subtype A1 (A2) with respect to all the three gene catalogues.

Subtype-specific gene expression alterations.
To investigate the tumor subtype-specific gene expression alterations in HNSCCs, we performed seven Mann-Whitney tests (or comparisons) on each gene, using the information of the "expression set" in Table 1. Specifically, tests "CTR-A1N", "CTR-A2N", "CTR-CN" and "CTR-DN" compared the four major subtypes (i.e. A1, A2, C and D) with normal tissue samples, respectively. Test "CTR-CD" compared subtype C to subtype D to identify the genes whose expression is impacted by HPV infection. Test "CTR-A2C" compared subtype A2 to subtype C to pinpoint the genes whose expression is associated with the genotypes of TP53. Test "CTR-A1A2" compared subtype A1 to subtype A2 to identify the genes whose expression is associated with the genotypes of CDKN2A. We didn't further analyze the results from the last two comparisons because (1) the 50 significant genes identified from test "CTR-A2C" lack functional similarity; and (2) for test "CTR-A1A2", in which only one gene satisfies the adopted significance criterion, i.e. the expression of CDKN2B is up-regulated in subtype A1.
For each of the other five comparisons, we performed further analyses by the following procedures. We first scanned 16000 genes that have an expression (RPKM > 2) in at least half of the tumor and normal tissue samples, obtaining the p-values and fold changes (FC) for the between-group differences. Then, we adjusted the p-values by the Benjamini-Hochberg (BH) method, and identified the differentially expressed genes by the criteria of adj.p < 0.01 and FC > t (t was set to be 2.0 for comparing tumors and normal tissue samples and 1.5 for comparing two tumor subtypes). In this way, we generated five subsets of significant genes with the sizes between 2609 and 3056 (Supplementary Table S1). Finally, we performed functional enrichment analysis on each gene subset using the DAVID tool 33 .
We organized the 17 significant KEGG pathways 34 , which are over-represented (BH adj.p < 0.01) by at least one of the five gene subsets, into four pathway clusters (Fig. 3). Cluster-1 (red) includes "Focal adhesion", "Cytokine-cytokine receptor interaction" and "ECM-receptor interaction". They are common to all the comparisons between normal tissues and the four tumor subtypes, outlining the functional implications of the gene expression alterations in head and neck cancer cells. Cluster-2 (purple) includes "tp53 signaling pathways" and two DNA processing mechanisms, namely "Mismatch repair" and "Base excision repair". They are unique to the comparison between HPV-positive and HPV-negative tumors and are our primary focus for further study in the next paragraph. Cluster-3 (blue) consists of "Cell adhesion molecules" and four immunity-relevant pathways such as "Primary immunodeficiency", which characterize the virus infected tumors intuitively. Cluster-4 (green), shared by tests "CTR-DN" and "CRT-CD", demonstrates the interference of HPV infection in cell cycle (a cancer hallmark) and DNA replication (directly related to the duplication of virus). As to this cluster, a major remaining question is why the expression alterations of cell cycle genes, as a cancer signature, were observed only in HPV-positive tumors.
We further closely examined the tumor subtype-specific expression profiles of member genes involved in the three pathways in Cluster-2. The scrutiny demonstrated that DNA mismatch repair (MMR) 35 had a clear relationship with the HPV status of cancer cells. Among the 23 genes annotated to MMR, RPA4 was hardly expressed in the analyzed samples and therefore was excluded from the analysis hereafter. We treated the remaining 22 genes as the "operational" MMR genes in the normal and cancer cells of head and neck tissues, and depicted their tumor subtype-specific boxplots and Mann-Whitney test results in Fig. 4. For reference purposes, we also displayed the corresponding results for CDKN2A and TP53 in the same figure. We found that over 50% of these MMR genes had a consistent transcription pattern with the expression levels in the following order, HPV-positive tumors > HPV-negative tumors > normal tissue. In particular, the transcriptionally altered genes were involved in all major steps of MMR, namely mismatch recognition, the excision of mismatched DNA, and DNA re-synthesis and ligation (See the legend of Fig. 4 for details). Based on these observations, we provided an explanation for the relationship between the subtype-specific survival profiles and the mutation spectra in the Discussion section. To study whether the 22 operational MMR genes can constitute a strong signature for the identification of HPV infection, we performed a Singular Value Decomposition (SVD) on the transpose of the row-centralized expression matrix of these genes for the 296 HNSCCs. By integrating the first and second left SVD vectors (u 1 and u 2 ), we calculated a score vector w, with w i = f (u 1i , u 2i ) being the value for the i th element (sample), to separate the HPV-positive tumors from the other types (See Methods section). Using the receiver operating characteristic (ROC) analysis, we compared the predictive strength of the derived feature with that of u 1 , u 2 or the expression of CDKN2A, a well-known biomarker for HPV infection. In the implementation, sensitivities and specificities were tabulated at the different possible thresholds of a diagnostic test. As shown in Fig. 5, the proposed method significantly outperformed the use of u 1 , u 2 or CDKN2A expression alone in terms of AUC (Area Under the Curve). In particular, we achieved a classification result with both sensitivity and specificity over 0.92.
Validation of MMR prognostic signature. To validate the identified MMR transcriptomic signature, we performed a hierarchical clustering analysis on the GEO microarray dataset GSE3292 21 . As shown in Supplementary Figure S1, the 36 HNSCCs could be largely partitioned into two clusters and a scalar. The eight HPV-positive tumors were exclusively grouped together. The pattern of enhanced expression of MMR genes in HPV-positive tumors was clear.

Subtype-specific miRNA-mRNA interactions.
In inferring the subtype-specific miRNA-mRNA interactions, we focused on 131 miRNAs and 4875 mRNAs that show significant transcription alterations (adj.p < 0.01 and FC > 2) in at least one tumor subtype compared to the normal tissue. The transcriptional correlations (connections) between miRNAs and mRNAs were calculated by the Pearson coefficient and the significance levels were evaluated by a t-test. Based on the correlations, miRNA:mRNA interaction modules were identified by the method presented in the Material and Methods section.
As summarized in Fig. 6 and Supplementary Table S2, we identified 5 or 6 miRNA-mRNA module pairs (MPs) for each tumor subtype. Each MP included one positive-connection module and one negative-connection module. The number of miRNAs or mRNAs (genes) in each module varied from 2 to 17 or from 0 to 322. Most of the modules contain transcription factor (TF) genes that may take roles as the mediator for the miRNA-mRNA connections. The two members (such as tp53-mut_HPV− /modu-I-ne and tp53-mut_HPV− /modu-I-ps) of a MP have the same miRNAs but different mRNAs. Two modules of distinct MPs (such as tp53-mut_HPV− /modu-I-ps and tp53-mut_HPV− /modu-II-ps) consisted of different miRNAs and varied (or partially overlapped) mRNA sets. Within a positive or negative connection module (indicated by a "-ps" or "-ne" extension in the IDs), the miRNA-mRNA correlations at the expression levels were consistently positive or negative. Regardless of the connection type, the mRNAs (or miRNAs) in each module naturally represent a co-expressed gene cluster.
In the literature, it was reported that miRNAs -150 and -155 control B and T cell differentiation 36 . Here, we note that, among the positive connection modules of each tumor subtype, the one containing these two miRNAs was most significant in that the number of the modular genes was the largest and the paired negative-connection module was empty or nearly empty. Functional enrichment analysis (Supplementary Table S3) demonstrated that multiple immunity-related GO terms and KEGG pathways were over-represented by the member genes. The HPV-positive tumors differentiated from others in that multiple miRNAs (miRNA-148b,-29c,-625, and -766), along with miRNA-150 and -155, were present in the modules of subtype D. The subset of the modular genes was largely overlapped with another positive-connection module defined by let-7c and miRNA-99a. These observations implied that miRNAs were more widely involved in immunity in HPV-positive tumors than in the tumors of other subtypes.
Among the four focused tumor subtypes, A2 had the largest sample size (N = 128), which would lead to a high statistical power in the network analysis. For this subtype, we identified two semi-canonical regulatory modules (semi-CRMs) (Fig. 7, top row), in which the 3′ UTR sequences of the involved mRNAs were enriched with the target site motifs of the modular miRNAs. Several gene ontology (GO) terms and KEGG pathways, including GO:0031012~extracellular matrix, GO:0007155~cell adhesion, hsa04512:ECM-receptor interaction and others, were over-represented by the member genes. These two semi-CRMs had approximate counterparts among the negative-connection modules of subtypes A1 and C (Fig. 7, middle row). However, neither major negative-connection modules of subtype D met the minimal requirement for a semi-CRM (Fig. 7, bottom row). These results indicated that HPV infection could disrupt some regulatory miRNA-mRNA relationships observed in the HPV-negative tumors.

Discussion
Patients with HPV-positive HNSCCs have a good prognosis but the underlying mechanism remains unclear. As the major conclusion of this integrative genomics and transcriptomics analysis, we proposed a corollary hypothesis for the favorable relationship between HPV infection and patient survival. That is, the replication of HPV sequences and/or the invasion into the genomes of cancer cells may enhance the DNA repair mechanisms, which in turn limit the accumulation of lethal somatic mutations. This hypothesis is equivalent to a heuristic model describing the potential carcinogenesis of HNSCCs and the genetically defined progressive relationships between different tumor subtypes (Fig. 8). The supporting evidences include several observations (OBSs) derived from the profile of tumor subtype-specific genomic and phenotypic features.

OBS-I.
Co-occurrence of mutant TP53 and HPV infection is rare in HNSCCs. This observation confirmed the result reported by a recent TCGA publication 19 . We also notice that Smith et al. 37 made a similar statement 37 . However, their claim was not sufficiently supported by the cited evidences 38,39 . In fact, this pattern might be masked due to the relatively weak predictive power of the utilized methods. For example, several previous studies employed the expression of p16 protein and the presence of HPV DNA (detected by a PCR-Based Mass Spectrometry System) as surrogate markers for oncogenic HPV infection [40][41][42] . However, these biomarkers (or predictive signatures) cannot guarantee high discovery specificity. As to the TCGA data focused in this study, the TP53-mut_HPV+ group contains only two samples. If the DNA presence-based technique, rather than a virus expression based method, was used to call HPV status, 22 HNSCCs would be partitioned into this subtype ("nationwidechildrens.org_clinical_patient_hnsc.txt", a TCGA dataset downloaded on 04/25/2014). As a result, we would at most conclude that HPV infection tends to be present in wild type TP53 tumors. It is worth noting that, at present, the most robust method to detect both the presence of viral DNA and potential viral integrations, either exonic/intronic or intergenic, could be the high-pass whole-genome sequencing (WGS). TCGA measured 29 HNSCC samples with this technique. However, the method is still too expensive for a wide range of clinical applications.

OBS-II.
Regardless of HPV status, HNSCCs of wild-type TP53 imply a good survival chance for patients and have fewer genome-wide somatic mutations than those with a mutation burden on the gene. TCGA performed a more specific comparison of the mutation spectra of HPV-positive and HPV-negative tumors 19 . This phenomenon, in combination with OBS-I, indicates that the survival advantage of patients with HPV induced tumors may be due to the lack of TP53 mutation and/or low incidences of other lethal mutations. The poor association of TP53 mutation with the clinical outcome of patients has been widely reported 20 . However, this issue is complicated by the fact that a CDKN2A mutation is only observed in mutant TP53 HNSCCs and the presence improves the survival of the patients. It seems that the mutated TP53 and p16 proteins exert a very strong epistatic effect on the formation of HNSCCs but the resulting growth advantage of cancer cells is not maintained in cancer metastasis and/or the resistance to treatment therapy. In fact, the effect of a mutation on cancer cells could be positive, neutral, or negative, depending on the microenvironment and cancer progression stage [43][44][45] .

OBS-III.
The genes involved in mismatch repair (MMR) pathway were up-regulated in HPV-positive tumors compared to both HPV-negative tumors and normal tissue samples. This observation suggests a potential mechanism for the association between the mutation spectra of HNSCCs and the HPV infection status. That is, the enhanced MMR may limit the accumulation of somatic mutations as well as the occurrence of TP53 mutation in HPV-positive tumors (see next paragraph for more discussion). There are some causal factors that may be responsible for the high expression levels of MMR genes. First, the integration of virus nucleotides into the genomes of host cells could stimulate the DNA repair pathways (see Supplementary Text 1 for a note). Actually, base excision repair (BER) and nucleotide excision repair (NER) pathways, another two mechanisms for maintaining genome stability 46 , were also over-represented (adj.p < 0.025) by the significant genes (most of them are up-regulated in HPV-positive tumors) identified in the comparison of subtypes C and D. Second, replication of the HPV genome depends on the host-cell DNA replication machinery 47 and some proteins required in the replication process play roles in MMR as well 48 . Third, the up-regulated expression of TP53 in high-stage HPV-positive tumors potentially compromises the E6/E7 (oncogenetic viral proteins) induced p53 degradation, maintaining the level of the cancer suppressor protein to activate the transcription of downstream DNA repair genes. This is unlike the situation in most HPV-negative tumors (e.g. subtypes A1 and A2) whose mutated p53 protein could lose the normal functions. This perception is supported by the result of an additional analysis of the proteomic data recently published by TCGA, that is, the HPV-positive tumors had lower TP53-R-V levels compared to HPV-negative samples but the difference was not significant (p > 0.05). In the biological inference presented above, the hypothesis that the enhanced expression of MMR genes may limit the accumulation of somatic mutations in HNSCCs still warrants further validation, since we lack proteomic data to confirm whether the high expression levels of mismatch repair genes in HPV-positive samples can truly enhance the proteins coded by them. However, previous studies provided some indirect evidences for the assumption. For example, Hsieh and Yamane 49 proposed that a hallmark of many MMR-deficient cells was instable at microsatellite regions consisting of mono-and di-nucleotide repeats 49 . Based on the model of C. elegans, Denver et al. 50 found that, compared to the wild-type population, DNA repair-deficient lines with mutant gene(s) in MMR, BER and NER had higher genome-wide mutation rates, in both base substitutions and indels 50 . Their study also showed a hierarchy in the relative importance of the three repair pathways in maintaining genome stability: MMR over NER over BER. Criss et al. 51 reported that mismatch correction modulated mutation frequency and pilus phase and antigenic variation in Neisseria gonorrhoeae 51 . Tomasetti et al. (2015) showed that the average somatic mutation rate in MLH1-normal colorectal cancers (CRCs) was 72% higher than that in MLH1-silent CRCs 52 .
The involvement of MMR proteins in HNSCCs has been studied recently. Theocharis et al. showed that the expression of the mismatch repair proteins was increased (55.1% in MSH2, 36.73% in MLH1) in tongue squamous cell carcinoma and was significantly associated with disease-free patients' survival 53 . Pereira et al. reported that low expression of MSH2 DNA repair protein in HNSCCs was associated with poor prognosis 54 . These results are largely consistent with our observations and speculations. That is, HPV-positive tumors, as a whole, had high expression levels of MMR genes, and the patients with this type of cancer had good clinical outcomes.
A reviewer of this paper had an interesting question about the MMR signature. That is, is there an enhanced proliferative potential of HPV-positive HNSCCs, where the MMR involvement will be collateral as being the integral part of the post-replication repair? We addressed this issue by scrutinizing the genes differentially expressed between HPV-positive tumors and HPV-negative tumors. We found that, among the ~2700 significant genes (Supplementary Table S1) identified in the comparison of "CTR-CD", 86 genes (subset-1) had been annotated to the Gene Ontology biological procession term "cell proliferation" (GO: 0008283) by Jan 29, 2016. In subset-1, only 41 genes (subset-2) were upregulated in HPV-positive samples compared to the HPV-negative counterparts. In this regard, it is hard to state that HPV-infected tumors are more proliferative than the others. Nevertheless, this possibility can't be excluded yet. The reason is that subset-1 does contain genes MCM2 and KI676, which have been reported as proliferation markers 55 and have no relation to DNA mismatch repair.
The model depicted in Fig. 8 was motivated by the mutation spectra of the entire HNSCC set. Similar to Weinberger et al. 40 , we displayed two potential paths for the generation of HPV-positive HNSCCs, namely tumorigenesis directly induced by HPV or initiated by other causal factors before virus infection. We further assumed that partial TP53 mutations could occur in the progression of established tumors. This proposition was supported by the similarity of gene expression profiles observed among the HPV-negative HNSCCs. The TP53/CDKN2A double-mutant tumors were considered as a special subtype (A1) due to its unique patient survival curve. CDKN2B, adjacent to CDKN2A in the human genome, was up-regulated in the tumors with mutant CDKN2A, and was the only differentially expressed gene in the comparison of "CTR-A1A2".
In the Results section, we showed that the expression profiling of MMR genes can be used to distinguish HPV-positive HNSCCs from other tumors. The MMR signature was further validated by analyzing an independent microarray gene expression dataset. The strength of the identified signature using the proposed SVD method, in reference to the gene expression of CDKN2A alone, was well demonstrated by the obtained predictive sensitivity, specificity as well as AUC score. By an additional analysis on 172 TCGA HNSCC samples with both RNA-seq and proteomic data, we further found that the highest specificity achieved by the protein p16_INKa-R-C based prediction is 0.89 to retain a sensitivity of 0.9. This result is inferior to that obtained using the proposed method. For the subtype prediction of a new test sample t, u 1t (u 2t ), which are required for computing the discriminative score w t , can be obtained by adjusting its gene expression vector (i.e. subtracting the means of the training set from the original values) and then projecting the vector onto the product of the first (second) right singular vector and the reciprocal of the first (second) singular value of the training set-based SVD.
Recently, Parfenov et al. 17 identified a methylation signature of 774 probes to distinguish HPV-positive HNSCCs from HPV-negative counterparts. We further evaluated this signature by replacing our MMR signature with it in re-performing the SVD and ROC analysis. The result showed that the predictive strength of this methylation signature was even inferior to that of the CDKN2A transcriptomic signature (Supplementary Figure S2).
Numerous studies have shown that aberrantly expressed miRNAs are likely to contribute to human diseases, including cancer. It has been recognized that the interference of miRNAs with tumorigenesis is quite complicated and needs to be scrutinized by the network-based systems biology approaches 56,57 . To our best knowledge, miRNA-mRNA network analysis has not been reported in head and neck cancer yet. In this paper, we initiated such a study, focusing on (tumor) subtype-specific miRNA-mRNA interactions. The results show that miRNAs are more widely involved in immunity in HPV-positive tumors than in the tumors of other subtypes in the sense that more miRNAs demonstrate modularized co-expression relationships with the genes playing roles in immune response in HPV-positive tumors. The regulatory network analysis also suggests that HPV infection may disrupt some regulatory miRNA-mRNA relationships observed in the HPV-negative tumors. It remains unclear if the disrupted regulation of miRNAs on mRNAs in cancer cells contributes to the desired survival of HPV-positive tumor patients by exerting deleterious impacts on cancer cells. Meanwhile, the observations inspire us to ponder whether non-coding HPV RNAs may serve as molecular decoys to sequester miRNAs from their target mRNAs or promote the degradation of cellular miRNAs. In this regard, it is worth noting that an evidence for virus-sourced cellular miRNA sponges was recently reported 58,59 . The mutation calls in whole exome sequencing (WES) were validated by TCGA using a PCR-based method. The estimated false positive rate was ~ 5% 19 . The expression data have been normalized and summarized by TCGA using the standard method. We performed logarithm transformation of the expression levels before the statistical analysis.

Material and Methods
We edited the clinical data by removing the tumors not covered by Tang et al. 8 . Three tumors, whose HPV-status is "intermediate" or progression-stage is "unavailable", were also filtered. For the somatic mutation dataset, the DNA variants in the "RNA" and "Silent" categories were excluded from analysis. HPV status. The HPV status was defined according to the results obtained Tang et al. 8 . Specifically, the authors quantified viral mRNA (based on TCGA RNA-seq data) by computing the fraction of viral reads (FVR), represented as parts per million (ppm) of total library size. We considered a tumor to be HPV-positive if the FVR for any one strain of HPV was larger than 0.5 ppm. It is worth noting that the HPV status in the TCGA file of "nationwidechildrens.org_clinical_patient_hnsc.txt" was determined by a PCR and Sequenom Based Mass Spectrometry System 42,60 .

Independent microarray data. A microarray dataset (GSE3292) was downloaded from Gene Expression
Omnibus (GEO) 21,61 . In this dataset, the gene expression profiling of 28 HPV-negative HNSCCs and 8 HPV-positive HNSCCs were measured by Affymetrix Human Genome U133 Plus 2.0 Arrays. The authors of data employed quantile normalization and logarithm transformation to preprocess gene expression levels, which were estimated from the raw data by the perfect match algorithm. In the Series Matrix File, most genes are measured by two or multiple probe sets. Before the analysis, we combined the expression levels of different Affymetrix IDs for the same gene within an individual sample by calculating the average. Survival analysis. We performed the survival analysis using the statistical functions in the R package "survival" 62 . For a univariate survival analysis with the cancer subtype as the predictor, the function "survdiff " was employed to generate the Log-rank test p-value. The Kaplan-Meier survival curves in Fig. 1, with the censored observations being marked by a vertical tick, were obtained by the function "survfit". A multivariate survival analysis, with "tumor subtype" and "age at initial diagnosis" as the predictors, was conducted using the function "coxph" that implements the Cox Proportional Hazards regression.

SVD implementation.
Suppose M p×q is the transpose of the row centralized gene expression matrix for q genes on p tumors, the singular value decomposition (SVD) is a factorization in the form of M = UDV′ 63 . In the decomposition, the p × q left factor matrix U has orthogonal columns, the q × q diagonal matrix D = diag(d 1 , … , d n ) contains positive singular values ordered as d i ≥ d 2 ≥ … d n ≥ 0, and q × q right factor matrix V′ has orthogonal rows and columns. Corresponding to the first two principal component vectors of matrix M, the leading left SVD vectors u 1 and u 2 can be used to partition the tumor samples into gene expression-related groups. In this paper, we proposed an ad hoc method to integrate them for the identification of HPV-positive tumors using the expression profiling of genes involved in mismatch repair pathway. We calculated a score vector w, with w i = f (u 1i , u 2i ) being the value for the i th element (sample). The sample was classified as HPV-positive if w i was greater than the threshold. The formula for calculating w i was defined as follows.
Identification of miRNA-mRNA Modules. We employed an algorithm similar to the one used in our previous study 57 . We organized the miRNA-mRNA correlations in the form of a matrix with mRNAs in rows and miRNAs in columns. According to the signs (positive or negative), we filled the matrix with 1 or − 1 for the top 2% of absolute correlations and 0 for the remaining elements. Using the matrix as the input, we generated a heatmap by applying the function "heatmap.2" in the R package "gplots". The layout of miRNAs and mRNAs in the heatmaps was based on a two-way hierarchical clustering analysis with Manhattan distance and Ward method as the arguments. We identified the miRNA-mRNA modules by the following steps. (1) Based on the dendrogram and the miRNA-mRNA connection patterns shown on the heatmap, several modular miRNA subsets (clusters) were visually determined; (2) For each of the miRNA subsets, the positive or negative connections with mRNAs were collected into several 2-column topology matrices, respectively; and (3) A miRNA-mRNA module pair was identified from the outputs of step (2) after dropping the mRNAs with only one (positive or negative) connection.
Target site enrichment test. Using a lab-owned R program, we identified the 7-mer and 8-mer miRNA target site motifs on the 3′ UTR sequences of the genes measured in the employed data. The binary miRNA-mRNA sequence affinity matrix (A) was then generated in a way such that an element (A ij ) of value 1 indicated the existence of target site motif(s) for the j th miRNA in the 3′ UTR sequence of the i th mRNA. For a miRNA, the statistical significance of the target site enrichment level in the list of the correlated modular genes was measured by the Fisher's exact test in reference to the level of the entire gene set.