Long-read sequencing reveals the structural complexity of genomic integration of HBV DNA in hepatocellular carcinoma

The integration of HBV DNA into the human genome can disrupt its structure in hepatocellular carcinoma (HCC), but the complexity of HBV genomic integration remains elusive. Here we applied long-read sequencing to precisely elucidate the HBV integration pattern in the human hepatocellular genome. The DNA library was sequenced using the long-read sequencing on GridION and PacBio Sequel II, respectively. The DNA and mRNA were sequenced using next-generation sequencing on Illumina NextSeq. BLAST (Basic Local Alignment Search Tool) and local scripts were used to analyze HBV integration patterns. We established an analytical strategy based on the long-read sequences, and analyzed the complexity of HBV DNA integration into the hepatocellular genome. A total of 88 integrated breakpoints were identified. HBV DNA integration into human genomic DNA was mainly fragmented with different orientations, rarely with a complete genome. The same HBV integration breakpoints were identified among the three platforms. Most breakpoints were observed at P, X, and S genes in the HBV genome, and observed at introns, intergenic sequences, and exons in the human genome. Tumor tissue harbored a much higher integrated number than the adjacent tissue, and the distribution of HBV integrated into human chromosomes was more concentrated. HBV integration shows different patterns between cancer cells and adjacent normal cells. We for the first time obtained the entire HBV integration pattern through long-read sequencing and demonstrated the value of long-read sequencing in detecting the genomic integration structures of viruses in host cells.

The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code Data collection The DNA library was sequenced using the long-read sequencing on GridION and PacBio Sequel II, respectively. The DNA and mRNA were sequenced using next-generation sequencing on Illumina NextSeq Data analysis HBV integration analysis FastQC (https://github.com/s-andrews/FastQC) software was performed for quality control of sequences. FLASH (https://ccb.jhu.edu/ software/FLASH) was used to merge paired-end reads from next-generation sequencing experiments. Seqkit2 (https://bioinf.shenwei.me/ seqkit) was performed to convert FASTQ files to FASTA files. We detected HBV integration breakpoints using BLAST30 and local scripts in all data from three platforms. The workflow was shown in Figure 1. The raw data were first mapped to the HBV genotype C genome (AB981580.1) using BLAST. Reads mapped to the HBV genome were retained using local scripts and then mapped to the human genome (GRCh38). For each sequence, we further filtered the files by selecting all viral HSPs and the first three human HSPs. We visually inspected these files to identify sequences containing human-virus-human or human-virus connections. Chimeric reads (read sequences that were partially aligned to the human genome and partially to the HBV genome) were retained and the complex integrated genome structure was analyzed. HBV integration breakpoints were annotated using homer (https://anaconda.org/bioconda/homer). To increase the length of the HBV integration sequences identified in the Illumina platform, we combined the paired-end reads, analyzed the extended reads, and uncombined reads separately. To verify the reliability of this workflow in detecting chimeric reads of HBV integration, we also used another pipeline to compare the consistency of these two methods. We used KMC software to break the HBV genome and sequence data into 31bp kmers. The local scripts were used to obtain the intersection of the two group kmers. Furthermore, we extracted the reads where the intersection kmers were located. HBV breakpoints with a supporting read number ≥ 2 were regarded as highly confident HBV integration sites in the subsequent analysis in Illumina sequencing. Gene expression analysis After mapping the reads to the human genome (GRCh37, hg19) using STAR (https://github.com/alexdobin/STAR), a normalized gene expression matrix file was obtained through HTSeq (https://pypi.org/project/HTSeq) and stringtie (ccb.jhu.edu/software/stringtie/). NOIseq (http://bioinfo.cipf.es/noiseq) was performed for differential expression genes analysis in no replicate sample. The threshold of differential

March 2021
expression genes was set as foldChange > 2 and adjusted P-value < 0.05. The "clusterProfiler"29 R package was used for Gene Ontology enrichment analysis29 on genes or nearest genes that the HBV breakpoints located. Enriched Gene Ontology terms with adjusted P-value < 0.05 were considered statistically significant. STAR-Fusion (http://star-fusion.github.io/) was performed to identify fusion transcripts from RNA sequences data.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A description of any restrictions on data availability -For clinical datasets or third party data, please ensure that the statement adheres to our policy The sequencing data has been submitted to the Genome Sequence Archive and is available under the accession number HRA001037.

Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences
Behavioural & social sciences Ecological, evolutionary & environmental sciences