HiCancer: accurate and complete cancer genome phasing with Hi-C reads

Due to the high complexity of cancer genome, it is too difficult to generate complete cancer genome map which contains the sequence of every DNA molecule until now. Nevertheless, phasing each chromosome in cancer genome into two haplotypes according to germline mutations provides a suboptimal solution to understand cancer genome. However, phasing cancer genome is also a challenging problem, due to the limit in experimental and computational technologies. Hi-C data is widely used in phasing in recent years due to its long-range linkage information and provides an opportunity for solving the problem of phasing cancer genome. The existing Hi-C based phasing methods can not be applied to cancer genome directly, because the somatic mutations in cancer genome such as somatic SNPs, copy number variations and structural variations greatly reduce the correctness and completeness. Here, we propose a new Hi-C based pipeline for phasing cancer genome called HiCancer. HiCancer solves different kinds of somatic mutations and variations, and take advantage of allelic copy number imbalance and linkage disequilibrium to improve the correctness and completeness of phasing. According to our experiments in K562 and KBM-7 cell lines, HiCancer is able to generate very high-quality chromosome-level haplotypes for cancer genome with only Hi-C data.

Human genomes are diploid with two homologous sets of chromosomes. Each pair of homologous chromosomes share high similarity but are different in genetic variants such as single nucleotide polymorphism's (SNPs) and insertions/deletions. The sequences of the variants on single chromosomes are called haplotypes, which provide us with more information of genetic makeup in an individual genome. The reconstruction of haplotypes, called phasing, plays important roles in different areas of biology such as genome-wide association [1][2][3] and population genetics studies 4,5 and is critical for advancing precision medicine 6,7 . Due to the extensive somatic variations such as somatic SNPs, structural variations (SVs) and copy number variations (CNVs), cancer genome is much more complicated than normal human genome. Although the complete map of cancer genome should contain the sequence of every single DNA molecule, with the existing technologies, it is believed too difficult to distinguish the copies of chromosomes on the same allele (from the same parent) generated by CNVs according to somatic SNPs. Nevertheless, phasing each cancer chromosome into two haplotypes according to germline mutations as done for normal genome provides a suboptimal solution to understand cancer genome.
However, phasing cancer genome is also a challenging problem, due to the limit in experimental and computational technologies. As far as we know, among the widely used cancer cell lines, only K562 and Hela have comparatively high quality chromosome-level phased genome available which are constructed by integrating multiple types of data 8,9 . The existing technologies of phasing can be divided into three main categories. The trio-based methods such as TetraOrigin 10 and Merlin 11 generate the haplotypes which are most consistent with pedigree structure and Mendelian segregation. These methods work well when the parent-child relationship is available. The population-based methods such as Beagle 12 and SHAPTIT 13 estimate haplotypes of an individual genome through linkage disequilibrium measures learned from a population of unrelated known haplotypes. These methods can accurately infer haplotypes up to ~300 kb, but are not able to generate chromosome-level haplotypes 14 . The sequencing-baseed methods such as HapTree 15 and HapCompass 16 take advantage of sequenced long reads or paired-end short reads to link variants into haplotypes directly. Compared with the trio-based and population-based methods which use extra information to estimate haplotypes, the sequencing-based methods take advantage of the information (reads) directly obtained from the genome to be phased, and thus are able to generate more reliable haplotypes and have a broader range of applications. Nevertheless, due to the lack of long-range linkage information for linking distant

Methods
HiCancer is composed of four steps: pre-phasing processing, phasing, post-phasing processing and completing. In the pre-phasing processing, somatic SNPs are filtered, and the genome is divided into LOH regions and non-LOH regions. In the phasing step, non-LOH regions are phased into haplotypes. In the post-phasing processing step, a series of strategies are used to improve the completeness and correctness of haplotypes. In the completing step, sequences in LOH regions and phased haplotypes in non-LOH regions are assembled into the chromosome-level haplotypes. The pipeline of the proposed method is illustrated in Fig. 1.
Step1-2: pre-phasing processing and phasing. In the pre-phasing processing step, we filter somatic SNPs and detect LOH regions in the genome. The SNPs not appearing in the SNP list provided by 1000 Genomes Project Phase 3 19 are seen as somatic SNPs and removed. LOH regions are detected by comparing the density of heterozygous SNPs on the cancer genome with that on a normal genome in the same region. Specifically, the human genome is partitioned into segments of fixed length (typically 1 Mbp) and a statistical test is used on each segment to decide whether it is a LOH region on cancer genome. For each segment, we obtain the proportions of heterozygous and homozygous SNPs p normal hetero and p normal homo = 1 − p normal hetero from a normal cell line GM12878. Given the total number of SNPs N on cancer genome in the same region fixed, the number of heterozygous SNPs N cancer hetero is assumed to be binomial distributed with parameters N and p normal hetero . With one-tailed test carried out, this region will be recognized as LOH region if p-value is smaller than some threshold (typically 0.05), which means the proportion of heterozygous SNPs in this region on cancer genome significantly lower than that on normal genome. The SNPs in the recognized LOH regions are seen as false positive and removed before phasing.
Next, the Hi-C reads are aligned to the human reference genome with BWA mem 20 and the alignments uniquely aligned with MAPQ higher than some threshold (e.g, 30) are kept. Specifically, the paired-end reads are split into two groups, one for each pair, and the two groups are aligned separately and then combined by a Perl script "two_read_bam_combiner.pl" downloaded from https:// github. com/ dixon lab/ bwa_ mem_ hic_ align er. Then, the SNPs in each continuous non-LOH region are phased by HAPCUT2 with default parameters using Hi-C paired-end reads. According to our experiments, the phased haplotypes of each continuous non-LOH regions usually contain one or a few main haplotype fragments and a large number of tiny fragments with one or more SNPs (see Fig. 1D). And some haplotypes contain fatal switching errors (see Fig. 1D,E).
Step3: post-phasing processing. Due to the CNVs, many regions on cancer genome are aneuploid regions with different copy numbers on two alleles, leading to different read coverages between alleles after mapping Hi-C reads. According to our observations, when assembling two adjacent phased haplotype fragments, the haplotypes with similar read coverages are more likely on the same allele. In the post-phasing processing step, this regulation is used to improve the completeness and correctness of the phased haplotypes. First, the allelic read coverage imbalance are leveraged to correct the switching errors. Second, the allelic read coverage imbalance and linkage disequilibrium information are used to assemble the fragmented haplotypes and fill the gaps in haplotypes by adding the lost SNPs back into haplotypes. We describe these strategies in detail as follows.
Switching error correction. We search for the potential switching points on the haplotypes of each non-LOH www.nature.com/scientificreports/ represents the number of SNPs before this position whose read coverage on haplotype h 1 is higher than haplotype h 2 and the other three counts have similar meanings correspondingly. With these counts, two ratios r before = c can be obtained. We define a position as a potential switching point if the two ratios are significantly different. In detail, if one of these two ratios is larger than some threshold (e.g, 2) and at the same time the other one is smaller than some threshold (e.g, 0.5), this position is seen as a potential switching point. Since usually many qualified potential switching points exist, we pick the one with the lowest ratio of ratios and switch the haplotypes at it. This process is repeated until no potential switching point can be found.
Dynamic programming vectors are used to obtain counts for all positions efficiently. Let c before 1 [i] be the number of SNPs whose read coverage on haplotype h 1 higher than haplotype h 2 before ith SNP. First we initialize c before 1 [0] to be 0. The rest part of the dynamic programming vector can be filled using the following recurrence relation: where cov 1 [i] and cov 2 [i] represent read coverages of ith SNPs on haplotype h 1 and haplotype h 2 respectively. The other counts can be calculated in the similar way with the following recurrence relations: r r = min{r before , r after } max{r before , r after } c before 1 [i] = c before 1 www.nature.com/scientificreports/ Assembly of haplotype fragments and gap filling. We developed a graph model called coverage matching graph (CMG) to assemble main haplotype fragments and fill the gaps with tiny haplotype fragments at the same time.
A CMG G is an undirected weighted graph in which each vertex represents one of the two haplotypes of a single SNP. Given two SNPs s i and s j belonging to different fragments with haplotypes s a i , s b i and s a j , s b j respectively, we create four undirected edges (s (1) the genomic distance between s i and s j does not exceed some threshold (e.g, 1 Mbp); (2) s i is one of the n (e.g, 5) closest SNPs to s j and s j is same to s i ; (3) at least one of s i and s j belong to the the block exceeding a minimum number of SNPs (e.g, 100) and (4) the read coverage on each of (s a i , s b i ) , (s a j and s b j ) is higher than some threshold (e.g, 10). Each of these four edges is assigned a weight w(s x i , s y j ) which represents the likelihood of the two vertices connected belonging to the same haplotype. In detail, the calculation of edge weights is based on the probabilistic model below. Let's take w(s a i , s a j ) and w(s b i , s b j ) which represent the likelihood of haplotypes "aa|bb" for instance. Given the read coverages c(s a i ) and c(s b i ) of s a i , s b i , the read coverage proportions can be calculated as ) . In the case of "aa|bb", we assume the read coverage respectively. Since the Hi-C read coverage is usually biased by many factors such as GC-content, mappability and the density of restriction sites, these two terms are normalized by the total coverage c(s j ) of s j to transfer absolute coverages to proportions. Since the normalized terms p(s a j )log(p a i ) and p(s b j ) log(p b i ) are both negative, to reduce the complexity of computation in later steps, reciprocals of opposite numbers of them . On the other hand, by estimating the log-likelihood of "aa|bb" given the read coverage proportions p a j and p b j of s a j , s b j , the other components w j (s a i , s a j ) and w j (s b i , s b j ) can be calculated in the same way. Thus, the weights of edges (s a i , s a j ) and (s b i , s b j ) are the summation of two components as w(s a i , . With the same approach, the weights w(s a i , s b j ) and w(s b i , s a j ) of other two edges can be obtained by calculating the likelihood of the alternative haplotypes "ab|ba". We normalize these four weights to make sure their summation to be 1. We measure the reliability of the edges between s i and s j by comparing "aa|bb" supporting weight summation w(s a i , s a j ) . If the ratio of the smaller summation and larger summation exceeds some threshold (e.g, 0.4), these four edges are all removed from G. For two adjacent SNPs s i and s j belonging to the same block, we create two edges in G connecting the vertices known to be on the same alleles with infinite weights.
Once the CMG G is built, we assemble haplotypes and fill the gaps. In each connected component G i of G obtained by Breadth First Search (BFS) 21 , vertices are partitioned into two groups (two haplotypes) by cutting a subset of edges, satisfying that each pair of vertices corresponding to same SNP much be assigned to different groups. Since there could be many feasible solutions, according to maximum parsimony strategy, we pick the partition which cuts a subset of edges with minimum total weights. With this approach, the assembled haplotypes will not be conflict with the any of the old haplotype fragments, due to the infinite weight assignments to edges between vertices known to be in the same haplotype. This problem can be seen as a variant of the regular Minimum s-t cut problem in graph theory, and thus we call it Minimum Multiple s-t Cut problem. The formal definition is as follows.
Definition 1 (Minimum Multiple s-t Cut problem) Input: A weighted undirected graph G = (V , E) , with V be a set of n pairs of vertices. Output: A minimum cut S, that is, a partition of the nodes of G into S and V \ S such that (1) exactly one of each pair of vertices belongs to S, (2) the total weights of the edges going across the partition is the minimum among all the partitions of the nodes satisfying (1).
In Theorem 1 below, we show that the Minimum Multiple s-t Cut problem is NP-hard by reduction from Max-Cut problem with non-negative edge weights which is known to be NP-hard 22 . In Max-Cut problem, we are given a weighted undirected graph, and we need to find a cut whose total weight is maximum among all feasible cuts. [i] =c after 1 www.nature.com/scientificreports/ Proof Given an instance G ′ = (V ′ , E ′ ) of Max-Cut with non-negative edge weights, we build an instance Given the complexity of Min-Multi-Cut problem, we revise one of the most popular heuristic algorithm for regular Min-Cut problem called Karger's algorithm 23 to solve it. Instead of determining the minimum cut for all pairs of vertices at same time, the revised Karger's algorithm iteratively determine the cut for two pairs of vertices at each time. We decide the order of the two pairs picked by an association graph A built from original graph G. The association graph is an undirected graph in which each vertex represents a pair of vertices in G and an edge indicates there are at least one edge between the two pairs of vertices in G. For an edge (u, v) in A which connects two pairs of vertices u a , u b and v a , v b in G, the weight w A (u, v) is calculated as follow.
where and r 2 = 1 − r 1 are the proportions of weights supporting two feasible cut " u a , v a /u b , v b " and " u a , v b /u b , v a " in G respectively. If one or more of these four edges don't exist in G, the weights of missing edges are used as 0. With association graph A, our heuristic algorithm can be carried out by iteratively randomly picking an edge from A with probabilities proportional to the edges weights and determine the cut for the corresponding two pairs of vertices in G. After each iteration, the association graph is updated by removing this edge picked and merging the two vertices. Since the edge weights in A represent the reliability, this process is in accordance with the intuition that the two pairs whose cut can be determined more reliably should be picked earlier.
We repeat this process for M times and pick the best solution, where M is a parameter specified by user which represents a tradeoff between the goodness of solution and running speed.
In this algorithm, the time efficiency directly affects the goodness of solution. To speed up, instead of sampling edge and merging vertices at each time in association graph, we generate a random permutation of edges with probabilities proportional to edge weights at once. Then the edges are picked in order of permutation to generate the cut. With the use of disjoint set 21  for each (u, v) ∈ E A ordered by P do 10: if FIND-SET(u) = FIND-SET(v) then 11: determine the cut of u a , u b , v a and v b in G 12: merge this cut into global cut C 13: UNION(FIND-SET(u), FIND-SET(v)) 14: end if 15: end for 16: end while 17: if score(C) < score(best_C) then  www.nature.com/scientificreports/ Gap filling in balanced allelic copy number regions using linkage disequilibrium information. After assembling and gap filling haplotypes with coverage information, there are still some tiny haplotype fragments not merged into main haplotypes yet. These tiny fragments mostly belong to the genomic regions with balanced copy numbers on two alleles, so that the coverage information fails to merge them to main haplotypes. For these regions, we leverage the linkage disequilibrium information. The main haplotypes of each continuous non-LOH region are chosen as the cluster center, and the tiny fragments are assigned to the closest cluster according to the genomic distances. In each cluster, the main haplotypes are used as "seed haplotypes" to guide the phasing of the whole cluster by Beagle (v5.1) software 24 with linkage disequilibrium information learned from a population of haplotypes generated by 1000 Genomes Project.
Step4: completing. Finally, we complete the reconstruction of chromosome-level haplotypes by connecting the sequences of LOH regions with haplotypes in non-LOH regions. This process is not trivial because a chromosome may contain multiple continuous LOH and non-LOH regions, and for every pair of LOH and non-LOH regions, one of the two haplotypes of the non-LOH regions need to be decided at the same allele as the LOH sequence. Intuitively, among all the possibilities of the chromosome-level haplotypes, the one with the most supporting Hi-C paired-end reads should be chosen. For each chromosome, we use a graph model in which each vertex indicates a LOH region or a haplotype of a non-LOH region. Between every pair of LOH vertex and a haplotype vertex, there is an edge with weight representing the number of paired-end reads with one end on the LOH region and the other end on the SNPs of the haplotype. The graph needs to be partitioned into two subgraphs (as two chromosome-level haplotypes) by removing a subset of edges which satisfies (1) two haplotype vertices of the each non-LOH region belong to different subgraphs (2) the total weights of the removed edges are minimum among all the partitions satisfying (1). Since the graph is small (fewer than 10 vertices) in most cases, all feasible solutions are enumerated and the optimal solution is obtained.

Results
We tested the performance of HiCancer in K562 25 and KBM-7 26 which are two human immortalized chronic myelogenous leukemia (CML) cell lines. K562 was derived from a 53-year-old Caucasian female in 1970 and KBM-7 was from a 39-year-old man. Although the two cell lines were both from CML patients, their genomes are significantly different. According to the previous study, K562 genome is near-triploid and most genomic regions are non-LOH, while KBM-7 genome is near-haploid 25,26 . Experimental results in K562 cell line. We chose K562 cell line because it has the highest quality phased haplotype sequences and detected LOH regions available which can be used as "ground truth" among the widely studied cancer genomes 8 . Nevertheless, the "ground truth" K562 haplotypes still have at least two shortcomings. First, the haplotypes in LOH regions are generated in the same way as non-LOH regions, which is incorrect in our opinion. Second, the haplotypes lose a large number of input SNPs and only 15 chromosomes have highquality haplotypes availiable. However, we still believe the "ground truth" haplotypes of the 15 chromosomes are able to prove the high correctness of HiCancer if the HiCancer haplotypes in non-LOH regions processed by removing SNPs not existing in "ground truth" haplotypes are highly consistent with the "ground truth" haplotypes. We first tested the correctness and completeness of haplotypes in non-LOH regions. Then we verified the accuracy of LOH regions detected. Overall, the chromosome-level haplotypes generated by HiCancer are tested.
For convenience of comparison with "ground truth" haplotypes, the input SNPs were also downloaded from 8 . The in situ Hi-C reads in K562 were downloaded from 27 and reference genome (hg19) was downloaded from UCSC Genome Browser website (https:// genome. ucsc. edu/). When running HiCancer, all the built-in tools including BWA 20 , HAPCUT2 17 and Beagle (v5.1) 24 were all run with default parameters. The threshold of MAPQ for filtering alignments was set to be 30. All the other parameters were default (given in parenthesis in "Methods" section).

HiCancer can phase cancer chromosomes correctly and completely in K562 cell line.
In this section, we tested the performance of HiCancer in terms of completeness and correctness by comparing with HAPCUT2 and WhatsHap 28 which are two most widely-used phasing tools using paired-end short reads (although WhatsHap is not specially designed for Hi-C data). Since HAPCUT2 and WhatsHap do not deal with LOH regions and thus are not able to generate chromosome-level haplotypes for many chromosomes of K562, to be fair, we only focus on the non-LOH regions in comparison. Two measurements were used to evaluate the completeness of non-LOH haplotypes for each chromosome. The first one was the number of phased large haplotype blocks with at least 100 SNPs in non-LOH regions, and the second one was the proportion of non-LOH SNPs contained in these large haploytpe blocks in total. To evaluate the correctness of haplotypes, absolute error rate (AER) of the largest phased block of each chromosome was used. The calculation of AER is as follows. For one chromosome, let S be the intersection of SNPs in HiCancer haplotypes and "ground truth" haplotypes. www.nature.com/scientificreports/ In this experiment, HAPCUT2 and WhatsHap were run with default parameters. Table 1 shows that HiCancer and HAPCUT2 both are able to phase the non-LOH regions of the vast majority of chromosomes into one single fragment of haplotypes. On chromosome 9 and 14, they both provide two large fragment of phased haplotypes because these two chromosomes both contain very long LOH regions between non-LOH regions. Chromosome 3 and X do not need to be phased because the whole chromosomes are LOH regions (see Table 2 for details). Compared with HAPCUT2 which generates haplotypes with about 70-80% of SNPs, HiCancer is able to phase haplotypes with 100% or almost 100% SNPs. With the step of completing, HiCancer is able to generate a complete single pair of haplotypes for every chromosome. At the same time, HiCancer outperforms HAPCUT2 in correctness on 12 of the 15 chromosomes with "ground truth" haplotypes available. On the other 3 chromosomes, HiCancer gives comparable correctness. On chromosome 7, HAPCUT2 generates haplotypes with extremely high AER (46.8894%) caused by a switching error appearing at the middle of the chromosome. HiCancer successfully corrects the switching error and reduces the AER to 1.5327% which is at the same level as the AER of other chromosomes. Compared with HAPCUT2 and HiCancer, WhatsHap is only able to generate short haplotype fragments, which lead to the extremely low completeness in the whole genome and high accuracy in some chromosomes.
HiCancer can detect LOH regions accurately in K562 cell line. We also tested the accuracy of LOH regions detected by HiCancer. LOH regions detected by 10x genomics reads in 8 were used as "ground truth". The accuracy was evaluated by the precision and sensitivity calculated as follows. For each chromosome, we obtained the total length "length_call" of detected LOH regions , total length "length_truth" of LOH regions in "ground truth" and the length "length_overlap" of their overlaps. The precision is the ratio of "length_overlap" and "length_call" , and sensitivity is the ratio of "length_overlap" and "length_truth".
Observed from Table 2 that HiCancer detects LOH regions with high accuracy. On the 11 chromosomes with total length of LOH region higher than 4Mb in "ground truth", HiCancer obtains sensitivities from 94.4 to 100% and most of them are higher than 98%. On 10 of these 11 chromosomes, the precisions are from 70.2 to 83.0% and the precision of chromosome X is 50.9%. For the whole genome, the precision is 70.58% and sensitivity Table 1. Comparing the phasing performance of HiCancer with HAPCUT2 and WhatsHap on K562 genome. "Chrom" is the abbreviation of "Chromosome"; "HiCancer−" represents the output of HiCancer without doing the step of completing. Numbers in boldface highlight the best completeness and correctness. For the chromosomes whose whole or almost whole chromosomes are LOH regions, 'N/A' is used for some statistics cannot be calculated. www.nature.com/scientificreports/ is 98.22%. We further investigated the reason that the precisions are not as high as sensitivities. We found that on many chromosomes such as chromosome 3,9,13,14,22 and X, HiCancer treats almost the all regions as LOH while the "ground truth" only treats parts of the chromosomes as LOH regions. However, from Figure 1 of their paper 8 , it is easy to see that almost all regions of these chromosomes are short of heterozygous SNPs. We believe this is caused by the different definitions of LOH regions. Some regions with very small number of heterozygous SNPs were seen as non-LOH regions in 8 but detected by HiCancer as LOH regions because they should not be treated as two allele when building chromosome-level haplotypes.
Experimental results in KBM-7 cell line. In addition to K562 cell line, we also tested the performance of HiCancer in KBM-7 cell line. The Since KBM-7 cell line does not have high-quality phased haplotypes to verify the correctness, we only tested the completeness using the same criteria in K562 experiments. The in situ Hi-C reads in K562 were downloaded from 27 and reference genome (hg19) was downloaded from UCSC Genome Browser website (https:// genome. ucsc. edu/). BWA mem and GATK HaplotypeCaller (v4.1) were used to map Hi-C reads and call the SNPs respectively, both with default parameters. HiCancer, HapCut2 and WhatsHap were all run with default parameters. The default parameters of HiCancer are given in parenthesis in "Methods" section. Table 3 shows that, same as K562 cell line, HiCancer generates complete chromosome-level haplotypes with almost all SNPs in KBM-7 cell line. However, the results of HapCut2 and WhatsHap are different from those in K562 cell line in two aspects. First, the "% non-LOH SNPs in large blocks" values of HapCut2 fluctuate from 35.1531 to 99.0394% while the values are mostly around 80% in K562. Second, the "% non-LOH SNPs in large blocks" values of WhatsHap are significantly higher than those in K562, although they are still much lower than HapCut2 and HiCancer. We believe these differences result from the low proportion of non-LOH regions in KBM-7 genome.

Discussion
We presented HiCancer, a new computational pipeline for phasing cancer genome with Hi-C reads. We found that HiCancer is able to generate chromosome-level haploytypes for cancer genome with very high completeness and correctness using only Hi-C paired-end reads. Table 2. Performance of LOH detection of HiCancer on K562 genome. Numbers in boldface highlight the best completeness and accuracy. "Chrom" is the abbreviation of "Chromosome"; "Calls", "Ground truth" and "True positive" represent the total length of called LOH regions, total length of LOH regions in "ground truth" and their overlapped length respectively; "# LOH" and "# non-LOH" represent the numbers of continuous LOH and non-LOH regions respectively. www.nature.com/scientificreports/ There are a number of areas that our methods and approaches can be further improved. First, the current version of HiCancer needs SNPs called as input. Since the SNP list is not available for all cancer genome, it limits the application of HiCancer. It remains to be explored to improve HiCancer by adding a step of SNP calling using Hi-C reads. The difficulty is that the Hi-C read coverage is greatly affected by factors such as mappability, GC content and density of restriction sites which may reduce the accuracy of called SNPs. But we believe the appropriate normalization will be able to solve this problem. Second, HiCancer can only phase cancer genome using SNPs as markers for now, but the future work needs to be done to add the function of dealing with small insertions and deletions. Third, HiCancer only generates two haplotypes for each chromosome now, but future work is required to further distinguish multiple copies of each haplotype according to somatic mutations. Due to the complexity of cancer genome, this could be too challenging using only Hi-C reads. However, by combining Hi-C reads with the newest third-generation sequencing long reads such as Pacbio HiFi reads (high-fidelity long reads) and Oxford Nanopore ultra-long reads, we believe there is a chance to solve this problem and generate complete cancer genome map. Table 3. Comparing the phasing performance of HiCancer with HAPCUT2 and WhatsHap on KBM7 genome. "Chrom" is the abbreviation of "Chromosome"; "HiCancer−" represents the output of HiCancer without doing the step of completing. Numbers in boldface highlight the best completeness.  www.nature.com/scientificreports/