Vy-PER: eliminating false positive detection of virus integration events in next generation sequencing data

Several pathogenic viruses such as hepatitis B and human immunodeficiency viruses may integrate into the host genome. These virus/host integrations are detectable using paired-end next generation sequencing. However, the low number of expected true virus integrations may be difficult to distinguish from the noise of many false positive candidates. Here, we propose a novel filtering approach that increases specificity without compromising sensitivity for virus/host chimera detection. Our detection pipeline termed Vy-PER (Virus integration detection bY Paired End Reads) outperforms existing similar tools in speed and accuracy. We analysed whole genome data from childhood acute lymphoblastic leukemia (ALL), which is characterised by genomic rearrangements and usually associated with radiation exposure. This analysis was motivated by the recently reported virus integrations at genomic rearrangement sites and association with chromosomal instability in liver cancer. However, as expected, our analysis of 20 tumour and matched germline genomes from ALL patients finds no significant evidence for integrations by known viruses. Nevertheless, our method eliminates 12,800 false positives per genome (80× coverage) and only our method detects singleton human-phiX174-chimeras caused by optical errors of the Illumina HiSeq platform. This high accuracy is useful for detecting low virus integration levels as well as non-integrated viruses.


Supplementary consortium data
The paired tumour and remission genomes of the acute lymphoblastic leukemia patients were sequenced for the UFO sequencing consortium within the I-BFM Study Group. The sequencing consortium is currently coordinated by Prof Dr Andre Franke and Prof Dr Martin Stanulla.
The consortium comprises the following members (in alphabetical order): • Department of Oncology, University Children's Hospital Zurich, Zurich, Switzerland.
• Department of Pediatric Oncology, Hematology and Clinical Immunology, Heinrich Heine University, Düsseldorf, Germany.
• Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Germany. The paired-end reads are aligned to the host genome using BWA (aln -n 2), followed by BWA's sampe command. The SAM file must be retained for our Vy-PER pipeline, and also converted into sorted BAM format using SAMtools.

Extraction of partially unmapped read--pairs.
The unmapped end of a paired-end fragment which partly aligned to hg19 is extracted from the BAM file using SAMtools (view -f 4 -F 264). The resulting SAM file is converted to FASTA format using the Vy-PER script vyper_sam2fas_se.

Filtering of reads with less than 30 bp non--STR region.
Phobos 3.3.12 is used for exact STR-typing of each read, and vyper_sam2fas_se is used to remove reads in which the main STR leaves less than 30 bp of remaining read. This remaining read may be an STR, but the read is nevertheless retained at this stage, as it is difficult to rule out that a unique alignment to a virus genome will not occur. The threshold of 30 bp can be changed by the user, but was chosen as the default, because 30 bp single-end reads align unambiguously to about 80% of the human genome 45 . The reads that are left after this filtering are converted to FASTA format using the Vy-PER script vyper_sam2fas_se.

Alignment to all NCBI virus genomes.
The extracted reads are mapped to the virus genome FASTA file with BLAT (-out=blast8 -noTrimA -t=dna -q=dna -maxGap=0 -fastMap) and the results are summarised in a table of virus candidates per genomic integration site using Vy-PER_blatsam. Only the top 3 virus candidates per integration site are retained. Vy-PER_blatsam also generates a folder of FASTA files that can be used with other alignment tools for optional manual inspection of each virus candidate. The tool BLAT is employed because BWA is not a sensitive aligner, even with the setting "BWA -n 5". In our tests, BLAT was in some cases more sensitive than BLAST, and when we tested their speed, the same 1000 sequences were aligned 7× faster by BLAT.

Final filtering of false positive virus candidates.
The Vy-PER script Vy-PER_final_filtering removes virus candidate sequences consisting mainly of an STR (default threshold: 50% of virus candidate sequence length), which stem from a more complex read that was not removed in step 3. The remaining sequences are aligned to the host genome reference sequence window expected for both paired-end reads (window size = 5 × read length), using BLAT with sensitive settings (-out=blast8 -noTrimA -t=dna -q=dna -repMatch=1000000 -fine -tileSize=11 -stepSize=5). These settings make BLAT too slow for aligning low-complexity reads to the entire human genome. For instance, we aborted the BLAT-alignment of 770 such reads after 17.5 hours. For this reason Vy-PER_final_filtering outputs a FASTA file of virus candidates that can be filtered further using the FPGA-based Smith-Waterman aligner on the RIVYERA architecture. Furthermore, Vy-PER_final_filtering clusters virus candidates into genomic windows for graphical and tabular summary statistics. The default genomic window size is 1k bps, the minimal number of virus/host chimeras within the cluster is by default 10, and the minimal number of chimeras from the same virus within the cluster is by default 10. To increase sensitivity, the minimal number of chimeras can be reduced to 1. To reduce the number of genomic windows, their size can be increased without limit, e.g. to 1M bps.

Optional Smith--Waterman alignment to host genome.
The virus candidates in the FASTA file are aligned to hg19 with the FPGA-based Smith-Waterman aligner using the scoring matrix NUC44 (match: 5, mismatch: -4, gap open: -10, gap extension: -1) and requesting only one single best alignment. Vy-PER_final_filtering reads the Smith-Waterman aligner's out_detail.txt file and removes virus candidate sequences which aligned well to the host genome. Due to the strongly varying virus candidate sequence lengths, the alignment score does not allow a direct conclusion on whether a significant stretch of the read (default: 90% of the read length) was mapped (indel-free but permitting mismatches) to the host genome. Instead, we evaluated the CIGAR string of the alignment.

Results files.
The Vy-PER pipeline generates a) an ideogram plot in PDF format giving a summary of candidate loci and virus types (see main manuscript, Figure 1)

True human sequence removed by DUST filtering in SURPI and BLAST
SURPI reported a file with unmapped reads for our test data set NA12878V. One of the first reads that was not mapped to the human genome or to any other genomes is: @SRR010940.13891198/1 GACAGTTTGTTTGTTTGTTTGTTTGTTTGTTTGTTTTTTGAGATAGAGTCTTGCTCTGTTGCCC BLAT correctly maps this read to human chromosome 22: BLAST incorrectly does not map this read at all: Explanation: SURPI uses DUST to eliminate low complexity reads. BLAST also uses DUST and eliminated this real human read that was sequenced in the 1000 Genomes Project from the individual NA12878. Note that this read could be unambigously mapped to the human chromosome 22 by BWA.

Microhomologies: Three identical stretches between human and virus references in the 268T patient sequence at HBV integration locus chr19:30297359
Using a visual inspection tool and Dialign-based alignment, we aligned the 101 bp read that BLAT mapped to HBV to the expected 1001 bp window of hg19 (upper row of sequences).
The aligned sequences are: (a) the full human patient sequence that was partially mapped to HBV by BLAT (middle row with blue highlighted cell). The sequence stretch mapping to HBV ranges from column 857-910.
(b) the sequence stretch that was not mapped to HBV by BLAT (bottom row, columns 836-856). This stretch aligns perfectly to hg19, but the four next nucleotides (GCTA, columns 857-860) also map perfectly to hg19.
To clarify, a sequence stretch of four nucleotides is identical in the HBV part of the read, and in the hg19 reference.
It can be seen that two further stretches (columns 894-899, and columns 905-910) are identical between the HBV reference and the hg19 reference.
The four shared nucleotides in columns 857-860 appear to be consistent with random HBV integration into double strand breaks by the nonhomologous end joining (NHEJ) pathway, requiring microhomologies. The other end of the HBV integration shows a microhomology of five nucleotides (see page 8, fusion read 2 at chr19:30298787). The computed HBV integration on chromosome 5 even shows a shared stretch of 13 nucleotides (see page 9, Fusion reads 3 and 7 (identical) at chr5:1292392), but no matching "other end" for this HBV integration was detected.
We are not able to explain the other two identical stretches (columns 894-899 and 905-910), and they may just be a coincidence, although complex or repeated re-arrangements have been reported previously for HBV integrations.