Human papilloma virus (HPV) integration signature in Cervical Cancer: identification of MACROD2 gene as HPV hot spot integration site

Background Cervical cancer (CC) remains a leading cause of gynaecological cancer-related mortality with infection by human papilloma virus (HPV) being the most important risk factor. We analysed the association between different viral integration signatures, clinical parameters and outcome in pre-treated CCs. Methods Different integration signatures were identified using HPV double capture followed by next-generation sequencing (NGS) in 272 CC patients from the BioRAIDs study [NCT02428842]. Correlations between HPV integration signatures and clinical, biological and molecular features were assessed. Results Episomal HPV was much less frequent in CC as compared to anal carcinoma (p < 0.0001). We identified >300 different HPV-chromosomal junctions (inter- or intra-genic). The most frequent integration site in CC was in MACROD2 gene followed by MIPOL1/TTC6 and TP63. HPV integration signatures were not associated with histological subtype, FIGO staging, treatment or PFS. HPVs were more frequently episomal in PIK3CA mutated tumours (p = 0.023). Viral integration type was dependent on HPV genotype (p < 0.0001); HPV18 and HPV45 being always integrated. High HPV copy number was associated with longer PFS (p = 0.011). Conclusions This is to our knowledge the first study assessing the prognostic value of HPV integration in a prospectively annotated CC cohort, which detects a hotspot of HPV integration at MACROD2; involved in impaired PARP1 activity and chromosome instability.


Background
Human papillomavirus (HPV) infection remains the first cause of cervical cancer, one of the most common cancers in women worldwide. HPV infection is also involved in numerous and various types of cancer in penis, vagina, anus, and head and neck. In clinics, a persistent infection with HPV high-risk genotypes is associated with cancer progression and response to treatment. Therefore, being able to genotype HPVs in cancer samples, and to characterize the insertion sites and the relying mechanisms is of high interest in clinical research and diagnosis.
Thanks to the emergence of next-generation sequencing techniques, new approaches were developed to answer these questions. As any assay based on sequencing technology, these approaches requires appropriate bioinformatics methods and tools to check the quality of the samples, and to process and analyse the sequencing data.
In addition, the portability, scalability and reproducibility of bioinformatics pipelines is a well-known problem within the scientific community. Most of the time, bioinformatic analysis pipelines are designed for a single project execution, and therefore often require local customisation or modification of hard-coded part of the sources, which inevitably lead to reproducibility issues. The development of FAIR (Findable, Accessible, Interoperable and Reusable) methods and tools is therefore a priority and a need to address these issues.
Here, we present a new bioinformatics pipeline for virus insertion detection which is based on our previous experience and methodology on NGS-based HPV genotyping [1]. This pipeline relies on a dedicated workflow which briefly i) check the sample quality ii) detect the virus genotypes and ii) infer the precise insertion loci using a local mapping strategy. The pipeline is built following good coding practices and the Nextflow workflow management system, thus ensuring a high portability, reproducibility and scalability of the workflow.

Workflow description
Sequencing reads are first trimmed with the TrimGalore software to remove any adapter sequences at the 3' end of the reads, as well as low quality bases (--quality 20) and sort sequences (--lengths 20). While this step is optional, it is highly recommended to avoid noise in the downstream analysis and in the detection of soft-clipped reads.
FastQC is then used to assess the overall sequencing quality of trimmed sequences. It provides information about the quality score distribution across the reads, the per base sequence content (%T/A/G/C), any remaining adapter contamination and other overrepresented sequences.
Once the data are clean, the Bowtie2 software [2] is then used to : 1) Align all the trimmed reads from end-to-end against three control genes (KLK3, GAPDH, RAB7A) in order to estimate the virus load/burden. 2) Align all the trimmed reads from end-to-end against a list of virus references genomes.
3) Aligned all the trimmed reads using a local mapping algorithm against the, at most 3, major virus strains previously detected, in order to precisely define the virus insertion sites.
Once the data are aligned, we then use an in-house Python script to extract : 1) the breakpoint positions in the virus using 5'/3' soft-clipped reads.
2) the corresponding human sequences in these human/virus chimeric reads Finally, the Blat software is used to have a high accuracy alignment (>=90% similarity of length >=25 bases) of these extracted human sequences to extract precise insertion sites positions on the human genome. From these results, we then prioritize the detected breakpoints based on i) the mappability at this locus, ii) the number of supporting reads iii) the presence of both 5'/3' breakpoints on the same chromosome associated with the same virus strain.
All results are filtered and gathered in a user-friendly report through the MultiQC software [3]. The report can be done for all the samples together or for each individual sample. Briefly, it proposes a synthetic view of the data quality, alignment results, and detected virus genotypes. For each genotype, the reads coverage and the position of each detected breakpoint with the number of supporting reads is represented through dynamic view of the viral genome and its gene annotations.

Implementation
This pipeline is based on the Nextflow management system [4]. As other workflow management systems, Nextflow distinguishes the computing infrastructure requirement to the analysis workflow itself. It therefore natively introduces a high degree of portability and reproducibility of the analysis. As an example, most High Performance Computing schedulers are natively supported by Nextflow such as SLURM, LSF, PBS, Torque, etc. It also has built-in support for deployment using container technologies such as Conda, Docker, and Singularity. In addition, the baseline of the pipeline was written following best-coding practices promoted by the nf-core community. The nf-core is a community of almost 90 developers representing almost 30 organisations that provides state-of-the-art bioinformatics analysis pipelines and proposed strict guidelines for their development and the usage of the Nextflow management system.
The nf-VIF pipeline therefore inherits from these functionalities and provides support for Conda or Singularity usage. In practice, it means that the end-users do not need to manage the pipeline dependencies which are already provided in containers and/or recipes to create them. It can be run on a local infrastructure such a laptop, or on a computational cluster to process all samples in parallel. Each step of the pipeline can be managed to define the required resources in terms of CPUs, memory (RAM) and time. All these points are key features of reproducible research, and are especially important in the context of tools dedicated to the analysis of clinical and/or diagnosis data. Thus, nf-VIF provides a high level of reproducibility and tracability.

Usage example
The pipeline comes with a complete manual available online describing all pipeline options and outputs, as well as a test dataset to ensure that the pipeline is properly set up. By default, nf-VIF can be used in a very straightforward way by simply specifying a path to input sequencing data or a text file listing all samples to analyse (sample plan). Note that there is no restriction on the number of samples to analyse. All analysis will be managed with a queue system according to the resources available on the infrastructure. nextflow run main.nf --reads '*_R{1,2}.fastq.gz' -profile cluster,conda Note that specifying the options -profile cluster,conda will automatically build a conda environment and run the analysis on the cluster architecture. For more details about available profile, see the manual of the pipeline.
Importantly, nf-VIF is not restricted to any sequencing kits or capture protocol. All options, control regions and virus references can be updated by the end-users in command line. Here is a simple exemple to customize the list of reference virus and of control sequences.

Availability
The pipeline is freely available at https://github.com/bioinfo-pf-curie/nf-vif/ under the CECILL licence. More information are available on the website.