The landscape of viral associations in human cancers

Potential viral pathogens were systematically investigated in the whole-genome and transcriptome sequencing of 2384 donors across Pan-Cancer Analysis of Whole Genomes using a consensus approach integrating three independent pathogen detection pipelines. Viruses were detected in 485 genomic and 70 transcriptome data sets. Besides confirming the prevalence of known tumor associated viruses such as EBV, HBV and several HPV types, numerous novel features were discovered. A strong association was observed between HPV infection and the APOBEC mutational signatures, suggesting the role of impaired mechanism of antiviral cellular defense as a driving force in the development of cervical, bladder and head neck carcinoma. Viral integration into the host genome was observed for HBV, HPV16, HPV18 and AAV2 and associated with a local increase in copy number variations. The recurrent viral integrations at the TERT promoter were coupled to high telomerase expression uncovering a further mechanism to activate this tumor driving process. Most importantly, our systematic analysis revealed a novel association between mastadenovirus and several tumor entities. In renal cancer, mastadenovirus presence was significantly exclusive with well-known driver mutations in kidney cancer and defined a patient subgroup with better survival. Independent from mastadenovirus presence, high levels of endogenous retrovirus ERV1 expression is linked to worse survival outcome in kidney cancer.


Abstract
Potential viral pathogens were systematically investigated in the whole-genome and transcriptome sequencing of 2384 donors across Pan-Cancer Analysis of Whole Genomes using a consensus approach integrating three independent pathogen detection pipelines.
Viruses were detected in 485 genomic and 70 transcriptome data sets. Besides confirming the prevalence of known tumor associated viruses such as EBV, HBV and several HPV types, numerous novel features were discovered. A strong association was observed between HPV infection and the APOBEC mutational signatures, suggesting the role of impaired mechanism of antiviral cellular defense as a driving force in the development of cervical, bladder and head neck carcinoma. Viral integration into the host genome was observed for HBV, HPV16, HPV18 and AAV2 and associated with a local increase in copy number variations. The recurrent viral integrations at the TERT promoter were coupled to high telomerase expression uncovering a further mechanism to activate this tumor driving process. Most importantly, our systematic analysis revealed a novel association between mastadenovirus and several tumor entities. In renal cancer, mastadenovirus presence was significantly exclusive with well-known driver mutations in kidney cancer and defined a patient subgroup with better survival. Independent from mastadenovirus presence, high levels of endogenous retrovirus ERV1 expression is linked to worse survival outcome in kidney cancer.    Table, sheet "Candidate Reads"). 182,5 billion sequences were considered for further analysis as they were not sufficiently aligned to the human reference genome in the PCAWG-generated alignment (see Materials and Methods). Remaining reads ranged from 2,000 to 800 million per WGS tumor sample and 2.9 million to 120 million per RNA-seq tumor sample ( Figure 1A, Supplementary Figure 1A, B for WGS non tumor and RNAseq reads). Viral sequences were detected and quantified independently by three recently developed pathogen discovery pipelines CaPSID, P-DiP and SEPATH (see supplementary methods). The estimated relative abundance of a virus was calculated as viral reads per million extracted reads (PMER) at the genus level to improve data consistency between pipelines. To minimize the rate of false positive scores in virus detection, we applied a strict threshold of PMER>1 supported by at least three viral reads as similarly suggested by previous studies 12,13 . If a viral genus was identified by at least two of the three pipelines, it was considered present in the sample. The overlapping search spaces of the three pipelines revealed a total of 532 genera present in at least two of the pipelines (Supplementary Figure   1C). Filtering of laboratory contaminants of viral sequences was achieved through P-DiP, by examining each assembled contig of viral sequence segments for artificial, non-viral vector sequences (see Materials and Methods). The most frequent hits prone to contamination were for lambdavirus, alphabaculovirus, microvirus, simplexvirus, hepacivirus, cytomegalovirus, orthopoxvirus and punalikevirus ( Figure 1B), and these were observed across many tumor types.
We generally observed a strong overlap of the genera identified across pipelines (Supplementary Figure 1D). From the whole genome dataset, we identified 286, 887 and 215 virus-tumor pairs for P-DiP, CaPSID and SEPATH, respectively ( Figure 2A). Notably, there was no difference in the PMER distribution of common hits across the three pipelines indicating that a common detection cut-off is reasonable (Supplementary Figure 2B). The number of hits derived from the RNA-seq dataset decreases to 112, 83, 42 virus-tumor pairs for P-DiP, CaPSID and SEPATH, respectively ( Figure 2B). SEPATH, using a k-mer approach, detected the lowest number of virus hits and was the least sensitive. Despite this, the identified viruses matched well with the consensus (DNA 90%, RNA 95%). P-DiP, based on an assembly and BLAST approach detected more hits with 80% of the DNA and 54% of the RNA hits in the consensus set, while CaPSID, being most sensitive, implementing a two-step alignment process complemented by an assembly step, identified 53% (DNA) and 80% (RNA) hits within the consensus set. While the majority of the virus hits from RNA-seq (n=61/67) were overlapping with the WGS data, the reverse is not true, emphasizing the importance of DNA sequencing for generating an unbiased catalogue of tumor-associated viruses. This difference can also be attributed to the viral life cycle as during incubation or latent phases, viral gene expression can be minimal 14 . 89% of the sequence hits detected from WGS and RNA-seq data were found to be from the virus genome type of double-stranded DNA virus (dsDNA) and dsDNA with reverse transcriptase ( Figure 1C). This could be attributed to i) a higher frequency of tumor-associated viruses from these genome types 15 , ii) a larger sequence dataset for WGS in comparison to RNA-seq, iii) a potential limitation of our analysis due to DNA and RNA extraction protocols that are less likely to include ssDNA or RNA viruses ( Figure 1C).

The virome landscape across 39 distinct tumor types
We employed a consensus approach that resulted in a reliable set of 491 distinct virus-tumor pairs from WGS and RNA-seq data (Figure 2A  Interestingly, we did not identify a significant enrichment of co-infection of multiple viruses in any tumor type (Suppl. Figure 2C).

Alphapapillomavirus
Alphapapillomaviruses were mainly detected in head and neck cancer (n=18 out of 57), cervical cancer (n=19 of 20) and in two bladder cancer cases out of 23, in agreement with previous studies 5,19,20 . There is also supporting evidence for 32 out of 39 alphapapillomavirus hits in the transcriptome data ( Figure 2C). We observed only one HPV subtype per tumor according to the P-DIP results. At the subtype level, HPV16 was found to be the dominant type in cervix (n=11) and head and neck (n=15) tumors, followed by HPV18 only present in cervical cancer (n=6). As reported previously 13 , HPV33 was identified both in head and neck (n=3) and cervix (n=1) tumor samples. On the other hand, different HPV variants, type 6 and type 45, were detected in bladder cancer.
We further characterized the functional effects of alphapapillomaviruses in tumors by integrating external PCAWG datasets such as driver mutations, mutational signatures, structural variations, gene expression profiles and patient survival. In head and neck cancer, HPV-positive tumors exhibit an almost complete mutual exclusivity with mutations in known drivers like TP53 (p=4.88e-06) ( Figure 3A), as reported previously 19 , which could be explained by a mutation independent inactivation of TP53 through the human papillomaviruses [21][22][23] .
Analyzing the mutational signatures enriched in these cases, we identified mutational signatures 2 and 13 as enriched for alphapapillomavirus positive cases ( Figure 3B) 24 . In addition, the expression of APOBEC3B is significantly higher in the virus positive cervix and head and neck cancers compared to their virus negative counterparts ( Figure 3D) 25

Mastadenovirus
Mastadenovirus infection is very common in humans and estimated to be responsible for between 2% and 5% of all respiratory infections. It is also responsible for mild respiratory, gastrointestinal and eye infections. Its protein E1B is known to interact and inactivate the p53 tumor suppressor protein 28,29 . Mastadenovirus sequences were detected in several tumor tissues with the highest rate in renal cell carcinoma, breast adenocarcinoma, prostate adenocarcinoma, head and neck carcinoma and hepatocellular carcinoma (Supplementary Table, Figure 3D). Notably, we identified a considerable diversity of detected mastadenovirus contigs, supporting again the finding of independent viral infections (rather than common lab specific contaminations) (see Supplementary Figure 3F). In renal cell carcinoma of chromophore type, 21 of 45 samples were positive for mastadenovirus (Supplementary Table,

Endogenous retroviruses
Human endogenous retroviruses (HERV) are integrated in the human DNA originating from infection of germline cells by retroviruses over millions of years 30 and contribute 2.7% of the overall sequence to over 500,000 individual sites in the human genome 31,32 . The endogenous retroviruses were identified by the three pathogen detection pipelines but filtered by CaPSID and SEPATH. In addition, an alignment-based approach was used to detect HERV sequences embedded in the human reference genome that could be missed by the pipelines focusing only on non-human reads. In this study, we quantified the expression of HERV-like LTR (long terminal repeat) retrotransposons categorized into several clades by Repbase 33 database as ERVL, ERVL-MaLR, ERV1, ERVK and ERV (Supplementary Table, sheet "HERV expression"). In comparison to the other HERV families, ERV1 shows the strongest expression on average ( Figure 4A) and ERVK the highest fraction of active loci ( Figure 4B). Analyzing the expression of HERVs based on the available RNA sequencing data, we could identify a strong expression for ERV1 in chronic lymphocytic leukemia compared to all other tumor tissues and adjacent normal tissues ( Figure 4C). Analyzing the HERV expression in relation to patient survival, we could identify a high ERV1 expression in kidney cancer linked to worse survival outcome ( Figure 4D, for other HERVs and tumor types see Supplementary Figure 4).
Comparing the ERV1 expression with mastadenovirus expression in renal cancers we identified, in line with the better survival outcome of mastadenovirus positive cases, a significantly stronger activation of ERV1 in mastadenovirus negative samples (p=0.00083, Figure 4E).

Genomic integration of viral sequences
Viral integration into the host genome has been shown to be a causal mechanism that can lead to cancer development 34 . This process has been known best for human papilloma viruses (HPVs) in cervical, head-and-neck and several other carcinomas, and for hepatitis B virus (HBV) in liver cancer 35,36 . We searched the PCAWG genome and transcriptome cohorts for integration of those viruses that were detected by the CaPSID platform using the "Virus intEgration sites through iterative Reference SEquence customization" (VERSE) algorithm 37 .
This algorithm utilizes chimeric paired-end as well as soft-clipped sequence reads to determine integration with single base-pair resolution. Detailed assessment of this algorithm  Table "Integration").
Recurrent integration events across samples were also observed in the KMT2B gene (four events). KMT2B was recently identified to be a likely cancer driver gene 38,39 . When comparing gene expression in samples with virus integration to those without, only TERT was overexpressed (fold change ³ 2.0) in two liver cancer samples ( Figure 5E). Additional genes with 20 increased expression impacted by integration events include TEKT3, CCNA2, CDK15 and THRB ( Figure 5A).
Novel genes, which are impacted by integration events and associated with cancer include: CDK15 that was found to be over-expressed in our study and reported to mediate resistance to the tumor necrosis factor-related apoptosis-inducing ligand (TRAIL) 40 ; SEMA6D identified as a potential oncogene candidate in human osteosarcoma 41 ; and CDH13 that is commonly downregulated through promoter methylation in various cancers 42 . In addition, a novel integration site located in the promoter region of ERICH1 was detected. For further details see Supplementary Table (sheet "Integration").
There was a significant association between HBV viral integrations and SCNAs ( Figure 5C).  Figure 5F). Additional integration events with at least two sites were detected in CRAT, ERBB2, FRMPD4, MAGI2, MAMLD1, SLC9A7, STX17 and TP63 genes. None of these multiple integration events were observed to recur across multiple patients ( Figure 5B).
Integration events were also observed in two different lncRNAs, the plasmacytoma variant translocation 1 gene (PVT1), which is recognized as an oncogenic lncRNA observed in multiple cancers including cervical carcinoma 43,44 , and LINC00111, the function of which is still to be determined. Expression of both genes is strongly increased in the cases with HPV16 integration (Supplementary Figure 6F). Individual HPV16 integration sites were also found in a number of other genes including known drivers of tumor pathogenesis (TP63, P3H2, ETS2, CD274 (PD-L1), ERBB2, IQGAP1) (see Supplementary Tables, sheet "Integration") and genes that were previously not known to be strong candidates for playing a role in 21 tumorigenesis (CEACAM5, CRAT, ENTPD1-AS1, FRMPD4, MAGI2, MAMLD1, UTP11,   COL6A6, RASA3, SORBS1, STX17-AS1, IFT140 and DOLPP1).
Using the merged single nucleotide variant (SNV) calls from the three mutation calling pipelines (DKFZ, Broad and Sanger) 10 , and by comparing samples with viral integration events to those without, we have found a significant increase in the number of mutations occurring within +/-10.000 bp of high-confidence viral integration sites (average number of mutations per sample = 0.41 (HPV16 +) vs 0.14 (HPV16 -), p = 0.02; paired t-test one sided -alternative greater). Interestingly the integration sites are, compared to a random genome background, enriched in close proximity (<1000bp) to common fragile sites (p = 0.0018, statistical test).
These results suggest that HPV16 integration reflects either characteristics of chromatin features that favor viral integration, such as fragile sites or regions with limited access to DNA repair complexes, or the influence of integrated HPV16 on the host genome, both in close vicinity and a long distance away from the integration site. Such a correlation was not seen for the integration sites of other viruses (see Supplementary Figure 6 A-E).
Finally, a single AAV2 integration event located in the intronic region of the cancer driver gene KMT2B 45 was detected in one liver cancer sample.

Identification of novel viral species or strains
The CaPSID pipeline, combining both the reference based and de novo assembly approach, was used to search for potentially novel virus genera or species. De novo analysis has generated 56 different contigs that have been classified into taxonomic groups at the genus level by the CSSSCL algorithm 46 . After filtering de novo contigs for their homology to known reference sequences, we have identified 36 contigs in 35 different tumor samples showing low sequence similarity (in average 61%) to any nucleotide sequence contained in the Blast database (see Materials and Methods). In this respect, our analysis has shown that WGS and RNA-seq can be used to identify novel isolates potentially from new viral species. However, the total number of novel isolates were quite low in comparison to viral hits to well-defined genera ( Figure 2B). In addition, the findings were not enriched for a specific tumor entity but rather distributed across various cancer types including bladder, head/neck and cervical cancers (alphapapillomavirus) and many more (Supplementary Figure 7). 22

Conclusions
Searching large pan-cancer genome and transcriptome data sets allowed the identification of an unexpectedly high percentage of virus associated cases (16%). In particular, analysis of tumor genomes, which were sequenced on average to a depth of 30x coverage, revealed considerably more virus positive cases than investigations of transcriptome data alone, which is the search space looked at in most previous virome studies. This is probably mainly due to viruses that are transcriptionally inactive in the given tumor tissue. Co-infections, generally believed to indicate a weak immune system, were very rare (Supplementary Figure 2C). This could, however, also be the result of selection processes during tumorigenesis.
Not surprisingly, known tumor associated viruses, such as EBV, HBV and HPV16/18, were among the most frequently detected targets. This is in particular true for the common integration verified for HBV and HPV 16/18 in our study. In addition, the common theme of potential pathomechanistic effects by the genomic integration, nurtured also by the observations of multiple nearby integration sites in a given tumor genome that we also report in the present study, has gained further momentum. Analyzing the effect of viral integrations on gene expression, we identified several links to genes nearby the integration site. In this regard, the frequently observed integration of HBV at the TERT promoter accompanied with the transcriptional upregulation of TERT, constitutes an intriguing example, since an increased activity of TERT is a well-understood driver of cancerogenesis 47 . Furthermore, we also linked viral integrations to increased mutations (SNVs and SCNAs) nearby the integration site.
The known causal role of HPV16/18 in several tumor entities, that triggered one of the largest measures in cancer prevention, has been the reason for extensive elucidation of the pathogenetic processes involved. Nevertheless, comprehensive analyses of WGS and RNAseq data sets revealed additional novel findings. While we confirmed the exclusivity of HPV infection and TP53 mutations in head and neck tumors, we could also link virus presence to an increase in mutations attributed to the mutational signatures 2 and 13 48 . These are explained by the activity of APOBEC, which -among other effects -changes viral genome sequences as a mechanism of cellular defense against viruses 49,50 . This activation could play an important role in introducing further host genome alterations and, thus, constitute an important mechanism driving tumorigenesis 25,51 . Furthermore, the virus positive head and neck cancer samples had a significantly higher abundance of T-cell and M1 macrophage expression signals, which matches with the recently described subtypes of HNSCC that differ -among others -in virus infection and inflammation features.

23
The novel finding of a tumor association of mastadenovirus needs to be further emphasized.
It was only detected in the whole genome sequencing data set, which might explain why it was not linked to cancer in previous large transcriptome-based studies. Virus positive cases mainly occurred in renal cancer especially of the chromophobe subtype. Interestingly, viral infections were highly enriched in the specimen without VHL driver mutations. The viral protein E1B is known to inactivate tumor suppressor p53 28,29 and this could be an alternative route to malignant transformation in these carcinomas. Furthermore, virus positive cases showed a significantly improved overall survival. This observation further supports a different path of tumor development and progression of these virus positive cases. Future research should decipher the link between the mastadenovirus infections and cancer in detail.