Dense and accurate whole-chromosome haplotyping of individual genomes

The diploid nature of the human genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. This lack of haplotype-level analyses can be explained by a lack of methods that can produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-specific single-cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. We provide comprehensive guidance on the required sequencing depths and reliably assign more than 95% of alleles (NA12878) to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different technologies represents an attractive solution to chart the genetic variation of diploid genomes.


Supplementary Fig. 3: Quality measures used to evaluate predicted haplotypes.
Hypothetical phasing of 10 single nucleotide variants (SNVs) along a defined chromosomal region is shown here. Each heterozygous SNV is represented in its two allelic forms (0 -reference allele, 1 -alternative allele).
True (reference) haplotypes are distinguished in blue colors and predicted haplotypes in red. a To count the number of switch errors (black crosses) between the true and predicted haplotypes, neighbouring pairs of SNVs are compared along each haplotype and recorded as a new binary string of 0's and 1's depending on whether the allele state changes (see gray box). A zero value is assigned if the given pair of SNVs have the same value, otherwise a value of 1 is assigned value 1. The absolute number of differences in the binary string generated for the true and predicted haplotypes is reported as the total number of switch errors. b To calculate the Hamming distance, the absolute number of differences between reference and predicted haplotypes is calculated for all SNV positions. In addition we calculate block-wise Hamming distance which represents a cumulative sum of all Hamming distances across all phased segments (see Supplementary Fig. 5).

Supplementary Fig. 4: Coverage summary for various numbers of Strand-seq libraries
Plots shows the depth of coverage, genome coverage and number of reads when using different numbers of Strand-seq libraries (x axis). We have performed 5 randomized selections of any given library count, reflected in the error bars. Depth of coverage is calculated as an overall number of bases sequenced per genomic position (excluding gaps ("N") in the genome). Genome coverage is calculated as a percentage of genomic positions (excluding gaps in the genome) covered with at least one read. Duplicate reads and reads with mapping quality <

Supplementary Fig. 5: Comparison of block-wise Hamming distances
Each sequencing technology is combined with various numbers of Strand-seq cells and the block-wise Hamming error rate is calculated for each combination as the sum of all Hamming distances across all phased haplotype segments divided by the total length of these segments. Supplementary Fig. 6: Indel phasing performance when combining Strand-seq and

PacBio.
Plot shows the performance of indel phasing on Chromosome 1 when combining various numbers of Strand-seq cells (5,10,20,40,60,80,100,120,134) with selected coverage depths of PacBio sequencing data (2, 3, 4, 5, 10, 15, 25, 30, >30-fold). Left: percentage of indels phased as part of the largest block. Right: extra switch errors per extra phase connections (in the largest block), that is, we count the number of additional switch errors compared to when only phasing SNVs and divide by the number of extra phase connections (the difference of phase connections in the largest block when only phasing SNVs compared to phasing SNVs and indels), where "phase connection" is defined to be a pair of phased heterozyous variants consecutive in their phased block (the number of phase connections in the largest segment is hence equal to the number of heterozygous variants in the largest segment minus one).

Supplementary Note 3: Trio-aware read-based phasing
To perform trio-aware read-based phasing, we have used genotype data for parents (NA12891 NA12892) and child (NA12878) in conjunction with PacBio reads from the child to perform phasing in the PedMEC model (Minimum Error Correction on Pedigrees) 1 . To obtain genotypes for all three family members, we ran FreeBayes (v1.0.2) on all three samples 2 to regenotype all SNVs reported for NA12878 in the Platinum genomes data set (using options "-haplotype-basis-alleles" and "-@"). The resulting genotypes were filtered for those with a quality of 30 or above. We then used WhatsHap in pedigree mode (option --ped) providing it with family genotypes and different coverages of PacBio reads for the child (0x, 2x, 3x, 4x, 5x, 10x, 15x, 30x, all). Note that coverage 0x corresponds to pure genetic haplotyping relying solely on the genotypes. Pure genetic haplotyping can only phase variants that are not heterozygous in all individuals (i.e. homozygous in at least one), which lead to 83.0% of all variants being phased. Supplementary Fig. 7 shows a comparison of the obtained haplotypes to the Platinum Genomes phasing (left, red) and the haplotypes resulting from combining Strand-seq (all 134 cells) with PacBio (full coverage) in single individual mode (as discussed in the main text). By increasing the PacBio coverage from 0x to 10x, we are able to increase the completeness from 83% to 96% phased heterozygous SNVs. Switch and Hamming error rates indicate excellent agreement with the Platinum Genomes phasing (Supplementary Fig.   7, left). Since the genotype data we use for PedMEC phaing rely on the Platinum Genome BAM files, we aimed to provide additional evidence for the quality of the results. To this end, we note that the comparison to the haplotypes obtained from Strand-seq and PacBio (without using family information) also indicates very good agreement (Supplementary Fig. 7, right).