Improved haplotype inference by exploiting long-range linking and allelic imbalance in RNA-seq datasets

Haplotype reconstruction of distant genetic variants remains an unsolved problem due to the short-read length of common sequencing data. Here, we introduce HapTree-X, a probabilistic framework that utilizes latent long-range information to reconstruct unspecified haplotypes in diploid and polyploid organisms. It introduces the observation that differential allele-specific expression can link genetic variants from the same physical chromosome, thus even enabling using reads that cover only individual variants. We demonstrate HapTree-X’s feasibility on in-house sequenced Genome in a Bottle RNA-seq and various whole exome, genome, and 10X Genomics datasets. HapTree-X produces more complete phases (up to 25%), even in clinically important genes, and phases more variants than other methods while maintaining similar or higher accuracy and being up to 10× faster than other tools. The advantage of HapTree-X’s ability to use multiple lines of evidence, as well as to phase polyploid genomes in a single integrative framework, substantially grows as the amount of diverse data increases.


Supplementary Note 1: Theoretical Performance of DASE
We demonstrate in Methods the differential haplotypic expression level of a gene, β, and its coverage determine likelihood of concordant expression. We show this relationship below for varying β and levels of coverage. While these functions are derived from an idealized model of the data, this relationship suggests that as the depth-coverage of a dataset increases, so does the likelihood of concordant expression, and hence the accuracy of HapTree-X. Supplementary Figure 1 displays the theoretical curves depicting the exponential growth of likelihood of concordant expression as a function of coverage and β, the differential haplotypic expression. We infer from this theoretical result that requiring a lower bound of DHE is beneficial for reliable DASE-based phasing given moderate coverage. Furthermore, Supplementary Table 1 shows minimum coverage required to obtain a probability of at least 1 − 10 −α of concordant expression, given β.
As discussed in Methods, the exact solution of maximum likelihood (for any gene g; Equation (9) in the Methods) corresponds to that with concordant expression at all SNP loci within g. HapTree-X therefore uses a threshold λ (negative log-likelihood of concordant expression) which requires any SNP to be concordantly expressed with probability at least 1 − e −λ , in order to be phased. We run HapTree-X while varying this threshold λ; we compute the percentage of concordantly expressed SNPs and the total phased SNPs as we increase this threshold. As the threshold increases, HapTree-X demands any SNP to be phased to have a correspondingly high likelihood of concordant expression; as a result, the phasing accuracy of HapTree-X increases. The cost paid for this increase in accuracy is a decrease in the total number of SNPs phased, as seen in Supplementary Figure 2.
For the results reported, we used a threshold value of 20. In theory, this threshold value λ would produce a percentage of concordantly expressed SNPs equal to 1 − e −λ ; however because of the structural noise commonly observed in aligned RNA-seq data due to false mapping, RNA-editing, as well complex alternative splicing events, we require a λ > λ to meet desired accuracy levels. Additionally, motivated by the above-mentioned discussion we require an estimated β ≥ 0.6 for any gene to be phased using DASE, as it represents a reasonable compromise between the concordance probability and potential number of SNPs that are affected by DASE at that level (for a visualisation of β values please see Figure 1). Finally, we have several methods for managing alternative splicing events.
HapTree-X optionally can avoid all genes with alternative splicing, phase s, s only if the set of isoforms containing s, s are equal, or alternatively phase independent of isoforms but require s, s to have coverage and DASE that are sufficiently similar. This third option was used in the experiments with RNA-seq datasets. The first two options result in higher accuracy for lower λ, but of course phase fewer SNPs.  Table 1: Coverage needed to obtain likelihood 1 − 10 −α of concordant expression given a differential haplotypic expression of β and an assumed opposite allele error rate of 2%.

Supplementary Note 2: Experimental Setup
We evaluate haplotype reconstruction performance of HapTree-X on diploid RNA-seq, whole-genome, wholeexome, and combined sequencing datasets from a human female individual with European ancestry (GM12878), and compare the results to several other sequence-based phasing methods. We repeat this evaluation of HapTree-X on diploid RNA-seq, whole-exome, and combined sequencing datasets on the five Genome in a Bottle (GIAB) samples: NA12878, NA24143, NA24149, NA24385, and NA24631, and extend it to 10X Genomics data for two individuals NA12878 and NA24385. We also ran HapTree-X on Platinum Genomes NA12878 WGS sample.  Pre-processing: All samples were aligned and genotyped by GATK Best Practices pipeline, and the resulting VCFs were fed to each phaser as an input to be phased. In all experiments, we aligned RNA-seq reads directly to the reference genome (hg19) rather than a reconstructed transcriptome based on a gene model, as a significant percentage of RNA-seq reads align better to the outside of the exonic regions rather than inside; forcing these reads to align to the transcriptome will likely cause incorrect alignments and hence add noise to genotyping and phasing algorithms. 10X Genomics datasets were aligned by EMA [1].

Validation:
We assess the accuracy of phased haplotype blocks generated by HapTree-X, by comparing our phasing results to a high-quality GIAB phased VCFs as the ground truth. We restricted DASE-based phasing within HapTree-X only to the SNPs that are located within the same gene in the GENCODE gene annotation v19 (wgEncodeGencodeCompV19). K562 chronic myelogenous leukemia cell line RNA-seq data was validated with the recently-validated phasing ground truth dataset from ENCODE [2].
Tools: We compared our results to those of HapCUT v0.7, HapCUT2, and phASER v0.9. In the case of HapCUT and HapCUT2, to accommodate for long range splicing-junctions within RNA-seq read alignments, we defined maximum insert-size (maxIS) parameter to be longer than each chromosome's length. For HapCUT and phASER, we set minimum base quality (mbq, baseq) parameter to 0 to utilize maximum read contiguity information provided in the datasets (unless otherwise noted). We could not do the same for HapCUT2, so we used the default settings. All other parameters were set to be default.

GIAB RNA-seq sequencing
We repeated our RNA-seq and whole-exome experiments on the five Genome in a Bottle (GIAB) samples: NA12878, NA24143, NA24149, NA24385, and NA24631, and validated our results using high confidence variant calls from GIAB portal as ground truth, which corroborated our HapTree-X performance findings in the 1000 Genomes GM12878 individual. RNA sequencing for each GIAB sample was done in-house. Cells were obtained from Coriell and cultured according to instructions (upright with unfiltered lids overnight then with filtered lids and antibiotics). One million cells from each strain were then flash frozen in lysis buffer (1%BME in RLT). Four replicate RNA extractions were done for each strain using the RNeasy Mini Kit (QIAGEN). RNA-Seq library prep was done with a modified version of the SMART-Seq2 protocol [3,4]. Whole transcriptome amplification was performed using KAPA HiFi PCR Master mix (Kapa Biosystems KK2602) and IS PCR Primer (IDT). Size selection was done using Ampure DNA SPRI beads at 0.6x and 0.7x concentrations. Concentration was quantified using a Qubit. Sequencing libraries were prepared using the Nextera XT DNA tagmentation kit (Illumina FC-131-1096) with 250 pg input for each sample according to the manufacturer's instructions, and Ampure DNA SPRI bead size selection at 0.6x and 0.8x was done. Paired end sequencing was done using Illumina 300 Cycle NextSeq500/550v2 kit with loading density at 2.2 pM.

Phasing in the Presence of Multiple Isoforms
We have seen, with HapTree-X, that it is possible to perform phasing based on differential haplotypic expression (DHE) by making use of differential allele specific expression (DASE). In the case of HapTree-X, we restricted phasing depending on the structure of the isoforms of a gene; phasing in the general case of multiple isoforms is much more complicated. We discuss some cases in which phasing when there are multiple isoforms can be achieved.
Phasing based on DHE and DASE is immediately made more complex by the existence of multiple isoforms. This is primarily because the rates of differential haplotypic expression are independent across isoforms, and furthermore, each isoform is expressed at a different frequency. Fortunately, methods for determining the relative frequencies of expression of isoforms, or quantification, have been studied in [5,6]. We therefore will assume that these relative frequencies are known.
Not only are isoforms expressed at different frequencies, but the two haplotypes of any isoform may be differentially expressed as well; it is this differential expression that we wish to use for haplotype reconstruction. Unfortunately, these DHE rates are not known and we will see that depending on the situation they may or may not be able to be determined. Furthermore, we will see that knowing the differential expressions may not always be enough to determine the haplotypes, and vice versa. We explore this problem here.

Definitions and Notation
Recall the goal of phasing is to recover the unknown haplotypes, H = (H 0 , H 1 ), which contain the sequence of variant alleles inherited from each parent of the individual. From now on, we will restrict ourselves to one gene, G, with isoforms {I 1 , ..., I k }. Let S i denote the set of heterozygous SNPs that are present within the exons of the isoform I i ; we let S = ∪ i S i . For any s ∈ S, we let H 0 [s], H 1 [s] denote the allele present at s on the haplotypes respectively.
Suppose |S| = n and S = {s 1 , ..., s n }, we can define the Isoform Matrix M (an incidence matrix), whose rows correspond to the isoforms I i and columns correspond to the set of SNPs S; that is We assume that the quantification of the isoforms is known; that is each isoform I i is expressed at a rate proportional to α i For each SNP s j , we define the relative SNP expression level as A j , where denote the expressions of the maternal and paternal haplotypes respectively for isoform I i ; let δ i = (β M i − β P i )/2. We assume that M has all distinct rows, that is: isoforms covering identical sets of SNPs are considered to be the same. If this is not the case, say I 1 , ..., I k are all identical with quantifications q 1 , ..., q k and differential expressions δ 1 , ..., δ k , then we may represent these isoforms as one isoform I covering the same set of SNPs with quantification q = q i and differential expression δ = q i δ i . Finally, we let Suppose we are given, for each SNP s j , the proportion of alleles expressed that are the reference allele (0) and alternative allele (1); we denote these as v 0 j and v 1 j , respectively. In our ideal model of the world, all isoforms are known exactly, each isoform is expressed infinitely (in proportion to its quantification), no read covers more than one SNP, and differential expression is exactly constant across an isoform. In this case, the following should hold:

Matrix Representation
It will be useful to formulate these relations in terms of matrices. Let Q = {q i,j } be the k × k diagonal matrix with entries corresponding to the quantifications levels, that is: By definition m i,j = αimi,j Aj . The relations described in 1 and 2 can be represented equivalently as matrix relations as well. To do so, , without loss of generality will restrict ourselves to v 0 . Let S is invertible as σ j = ±1; indeed S = S −1 . We will see that being able to phase with multiple isoforms is equivalent to the existence of unique (up to sign) S such that there exists |δ i | ≤ 1 2 where S, δ satisfies 3.

Problem Statement
We assume an idealized model of the world, where all isoforms of a gene are perfectly known, as are their quantifications. Furthermore, each isoform is expressed infinitely (in proportion to its known quantification), no read covers more than one SNP, and differential expression is exactly constant across an isoform (though unknown). Given total proportions of reference and alternative alleles being expressed from all isoforms at each SNP locus, we can ask for how many possible pairs of haplotypes, given the known isoforms and their quantifications, it is feasible for us to have observed these allele proportions. In the case where there is a unique pair of haplotypes, we say the haplotypes can be recovered and there exists a unique haplotype solution.
In our notation from the previous section, we assume that the differential expressions δ and haplotypes S are fixed but unknown. We are given the isoforms M , their quantifications Q (and from M and Q we may deduce A and therefore M ). Finally we observe v, and equivalently z. The existence of a unique set of haplotypes is equivalent to there existing a unique (up to sign) S such that there exists |δ| ≤ 1 2 where z = δM S.

Zonotope Representation
We are fortunate to be able to borrow from the rich theory of polytopes, zonotopes, and hyperplane arrangements to describe the set of observations z for which a particular haplotype solution S is feasible. We will state several facts about zonotopes without proof; we refer the reader to [7].
A zonotope may be defined as the Minkowski sum of a set of line segments beginning at the origin. Given a set of points X ⊂ V (spanning V ), the zonotope Z(X) ⊂ V may be written as We can shift Z(X), to Z (X), so that it is centered at the origin, by subtracting the point x∈X 1 2 x from all points z ∈ Z(X); equivalently we can write It is known (but non-trivial to show) that Z(X) (or any polytope) may be written as the intersection over a set of half-spaces each defined by a hyperplane. For the case of zonotopes, there is a nice way to define these hyperplanes, described in Proposition 2.43 of [7]. To do so, let F denote the set of subsets of X that span a codimension one subspace of V . For each f ∈ F , let u f be a linear equation such that u f , v = 0 defines the subspace spanned by f . Finally, define µ − f and µ + f to be the minimum and maximum values attained by f over Z(X), respectively. With these definitions in mind, Proposition 2.43 of [7] states that Z(X) may be written as In the case of the shifted zonotope Z (X), u f takes the following minimum and maximum values on Z (X): We will use this half-space representation to describe the observations z for which a particular haplotype solution S if feasible. For any fixed S, the set of z satisfying 3 are those points in the shifted zonotope Z (X), where X is the set of rows of the matrix M S; we refer to this zonotope as Z S (M ). Recall S is the diagonal matrix, with entries σ j = ±1, representing the haplotypes. Without loss of generality, we consider only Z (M ) = Z I (M ), where I is the identity matrix. We may restrict to this case because any Z S (M ) is the reflection of Z (M ) through the set of coordinates j such that σ j = −1. For fixed n, the most general isoform matrix (which we will denote asM (n)) is the collection of all nonempty binary strings. Determining the linear equations defining all codimension one subspaces spanned by the row vectors ofM (n) is sufficient for understanding the hyperplane description of any Z (M ), provided M has at most n columns.
In the following section, we define the hyperplane descriptions of Z (M ) and their corresponding minima and maxima for the case when n = 2. The cases of n = 3 and n = 4 are included at the end of this section.

Phasing Two SNPs Given Multiple Isoforms
In the case where a gene G covers exactly two SNPs, we may write down simple necessary and sufficient conditions for the existence of a unique haplotype solution S given z, Q, and M . Recall that we may assume that M has no duplicate rows (that is, isoforms covering identical sets of SNPs are considered to be the same.) Therefore, without loss of generality, we can write: We investigate what kind of solution pairs δ, S satisfy zS = δM with −1 2 ≤ δ i ≤ 1 2 and S = σ1 0 0 σ2 , with σ j ∈ {±1}. The case S = ± ( 1 0 0 1 ) corresponds to the case in which the two SNPs are phased in parallel (the reference allele occurs at both SNP sites on one haplotype and the alternative allele at both SNP sites on the other.) The case S = ± −1 0 0 1 corresponds to the SNPs having switched phase (one haplotype has a reference allele followed by alternative allele and the other haplotype an alternative allele followed by a reference allele.) Note that if S, δ is a solution, then −S, −δ is also a solution and corresponds to relabelling the haplotypes (switching the maternal and paternal haplotypes), and we therefore restrict ourselves to the cases say S p = ( 1 0 0 1 ) and S s = −1 0 0 1 . It is possible to determine the phase of the gene G if there do not exist both forms of solution pairs S p , δ and S s , δ; having solutions of both forms imply z can be explained by either parallel or switched phasing, and therefore the phasing is ambiguous. We derive conditions on z which determine when z is such that it is feasible for G to be parallel phased; equivalently when there exist solution pairs S p , δ satisfying zS p = δM and − 1 2 ≤ δ i ≤ 1 2 . We also derive the corresponding conditions for the case of switched phase. The phasing of G is ambiguous when z satisfies both sets of conditions. From the viewpoint of section Zonotope Representation, we wish to describe Z Sp (M ), its reflection across the y−axis, Z Ss (M ), and their intersection; any point z in the intersection Z Sp (M )∩Z Ss (M ) implies the phasing of G is ambiguous. In the discussion that follows, we think of Z (M ) as the projection of the cube ∆ defined by − 1 2 ≤ δ i ≤ 1 2 by multiplication on the right by M (Supplementary Figure 3). To determine conditions for when z can be explained by parallel phasing, we derive equations for the hyperplane description of the zonotope Z Sp (M ) in terms of z and Q. Proposition 1. The following conditions are necessary and sufficient for the existence of δ ∈ ∆ such that (x, y) = z = δM , where ∆ is the cube defined by |δ i | ≤ 1 2 : Proof. From the discussion in section Zonotope Represenation, we begin by enumerating the linearly independent subsets of M who span a codimension 1 subspace, trivially these are the rows r 1 , r 2 , r 3 of M .
Equating (x, y) to δM implies the following equations: In general, the coefficient attached to δ i on the RHS is what the row r i evaluates to under the LHS. Each δ i is missing from one these equations, and these equations are therefore sufficient for defining the zonotope as an intersection of half-spaces. To determine those half-spaces, we follow Zonotope Represenation and minimize and maximize the RHS of each equation, implying the bounds in the statement of Proposition 1.
Proposition 2. The following conditions are necessary and sufficient for the existence of δ ∈ ∆ such that (x, y) = z = δM −1 0 0 1 , where ∆ is the cube defined by |δ i | ≤ 1 2 : This statement follows from Proposition 1 by a reflecting across the y-axis. By combining Propositions 1 and 2 we can determine when the phasing of G is ambiguous; that is, when both C − xy and C + xy hold. We include Supplementary Figure 4 depicting the zonotopes Z Sp (M ) and Z Ss (M ) as determined by 1 and 2. As intuition suggests, the larger α 3 is to relative to α 1 , α 2 , the smaller the set of z with ambiguous phasing (Z Sp (M ) ∩ Z Ss (M )). This intuition can be explained by the idea that I 3 is the only isoform with information about the true phase.

Modeling Noise
The model described in the preceding section assumed a theoretical set up where coverages of isoforms are sent to infinity in proportion to their quantifications, and the differential expression across an isoform is exactly constant. In reality, we do not have infinite coverage. Furthermore, we can have variation of coverage relative to the 'known' quantifications (of an entire isoform or just a SNP) and variation of differential expression within an isoform. These variations can be caused by both random and structural noise. Modeling all of these variations at once is equivalent to finding solutions δ, S to the equations , such that |δ i | ≤ 1 2 and σ j = ±1, the α , δ are bounded in some way, and possibly |δ i + δ,i,j | ≤ 1 2 as well. Even for fixed S, this problem is non-linear since the errors and δ are unknown. Instead, we discuss the simpler case of modeling variation of differential expression across an isoform.
Assuming fixed and known quantifications, we can model variation at the SNP level of differential expression. Without noise, differential expression ought to be the same for all SNPs across an isoform. We introduce noise limits ε = {ε i,j } to account for this variation. Given an observation v 0 , v 1 , we say S is a feasible haplotype solution if there exists |δ| ≤ 1 2 and ≤ ε (where { δ,i,j }) such that the following hold: Recall f k = ( 1 2 , 1 2 , ..., 1 2 ) and is of length k and that the column sums of M are one. Furthermore, we may wish to bound |δ i + δ,i,j | ≤ 1 2 since those limits correspond to only one haplotype being expressed.
If we impose |δ i + δ,i,j | ≤ 1 2 , then |z| ≤ 1 2 as well. For fixed S and given z, if we insist that |δ i + δ,i,j | ≤ 1 2 , then the existence of |δ| ≤ 1 2 satisfying 12 does not imply the existence of |δ| ≤ 1 2 , ≤ ε satisfying 11. If |z| ≤ 1 2 as it will be in practice, then there will exist a solution where the weighted average over i of |δ i + δ,i,j | is less than a half for each j.

Experimental Results
We wish to get a sense for whether or not phasing when multiple isoforms are present is possible with the current state of RNA-seq data. To do so, we see how accurate both our idealized and error based models are when implemented on RNA-seq data. We use RNA-seq raw read datasets of GM12878 obtained from ENCODE CSHL Long RNA-seq (wgEncodeCshlLongRnaSeq) track with average sequencing depth of 100 million mate-pairs (2x76bp), and transcriptome fragments sequenced from three different subcellular compartments: whole cell, cytosol, and nucleus. Also obtained from ENCODE CSHL Long RNA-seq are the transcript quantifications based on GENCODE gene annotation v7. We take as ground truth trio-phased SNPs of NA12878 from 1000 Genomes Project.
Based on the gene model mentioned above, we find all isoform covered SNP pairs (s1, s2) (that is, those which occur together in an isoform with non-zero quantification; from here on, when we say isoform, we mean an isoform with non-zero quantification.) We can ask what the coverage at (s1, s2) alone says about the feasible phasing solutions of the pair. This problem can be reduced to the setup in section 3.1.5 by letting α 1 , α 2 and α 3 equal the sum of quantifications of all isoforms covering only s1, only s2, and both s1 and s2 respectively. Suppose the pair has coverage [A 1 , B 1 ] and [A 2 , B 2 ] where A i , B i denote the number of reads containing the reference allele at s i and the alternative allele at s i respectively. In our previous notation, we can write (x, y) = ( A1 A1+B1 − 1 2 , A2 A2+B2 − 1 2 ) and check when either C − x,y (1) holds, which would mean parallel phasing is feasible, and when C + x,y (2) holds, implying switched phasing is feasible.
In this context, it is possible to have two, one, or zero feasible solutions. In order for phasing with multiple isoforms to be effective, we ideally would like there to exist a unique feasible solution a high fraction of the time, and furthermore for that solution to be the true phasing of the pair. In the table below we count the number of pairs with two, one, and zero feasible solutions. We also report how often the unique feasible solution is accurate (AUFS), by which we mean agrees with the phasing provided in the gold-standard platinum VCF. We also report the same stats in the case where we condition on the coverage being at least 15 for both SNPs; somewhat surprisingly this does not affect AUFS significantly.
To incorporate error into the model, we follow the formulation in 12 and ask, for fixed ε, when there exists a feasible solution to |z − δM S| ≤ ε.
Equivalently, for which z does there exist z with |z − z| ≤ ε satisfying: Let z = (x , y ), it follows from 1 that the set of z satisfying 13 are those which satisfy: for S = ± ( 1 0 0 1 ) and ± −1 0 0 1 respectively. Note that z will always satisfy C x and C y by definition. For any z = (x , y ), we can ask how much error is required in order for some S to be feasible; we let f p (x , y ) and f s (x , y ) denote such an error bound for parallel phased and switched phase solutions respectively: We can ask at various error levels, how many isoform covered SNP pairs have two, one, and zero feasible solutions (Supplementary Figure 5a). For the SNP pairs admitting only one feasible solution at a given allowed error bound, we compute the proportion of accurately phased SNP pairs (AUFS). As these curves do not vary significantly across Poly-A profile or replicate, we report these results for just Poly-A-replicate 2.
Generalizing one step further, we consider phasing a pair of SNPs when one phasing solution requires m ≤ ε m error in order to be feasible and the other requires M ≥ ε M error in order to be feasible. Using the notation introduced 14 and 15, we phase a SNP pair when either f p (x , y ) ≤ ε m and f s (x , y ) ≥ ε M or f s (x , y ) ≤ ε m and f p (x , y ) ≥ ε M . When these conditions are satisfied, we compare the solution with lower error to the true phase. We refer to the upper bound ε m as allowable min-error and the lower bound ε M as required max-error. By doing so, we attempt to capture the intuition that given one solution which appears to be very good, the worse the alternate solution is, the higher our confidence the good solution is the true solution.
Below we plot how many SNP pairs can be phased for various ε m , ε M (Supplementary Figure 7a) and how often those phasing solutions are accurate (Supplementary Figure 7b). Below, each curve corresponds to some ε m ∈ [0, .02, .05, .1]. The enlarged points on the curves correspond to ε m = ε M ; those points occur in Supplementary Figure 5b. In Supplementary Figure 6, for ε m = .05, we plot the proportion of SNPs phased and phasing accuracy curves jointly. In all cases, we require coverage of at least 15 for all SNPs. As intuition suggests, accuracy appears to be approximately monotonically increasingly, and percentage of SNPs phased approximately monotonically decreasing, differing from the results presented in Supplementary Figure 5b.
Intuition suggests that with higher coverage, accuracy ought to increase; unfortunately requiring higher coverage limits how often we are able to phase. We fix ε m = .05 and ε M = .15 and for varying required coverage, we report what proportion of SNP pairs have the required amount of coverage , how accurately we phase when the coverage and error thresholds are satisfied (Supplementary Figure 8), and of SNP pairs satisfying the coverage thresholds, what proportion satisfy the error thresholds and thus we choose to phase (Supplementary Figure 9).
We find that attempting to phase SNP pairs in the presence of multiple isoforms without taking into account error leads to low accuracy in phasing results, and furthermore for all replicates and both Poly-A profiles, between 55-60% of SNP pairs have no feasible solutions (Supplementary Table 3 and Supplementary  Table 4).
If we incorporate allowable error, the percentage of SNP pairs with a unique feasible solution is maximized around an allowable error ε = 0.05. As we increase the allowable error, the accuracy rate of phasing increases, but the percentage of SNP pairs with two feasible solutions with goes to 100% (Supplementary Figure 5a), as it must, and therefore the percentage with a unique feasible solution goes to zero (Supplementary Figure 5b).
When we allow two dimensional error thresholds, that is, when in order to phase we require one solution be very close to 'perfect' (low required error) and the other to be very far from it (high required error), we see (unsurprisingly) for fixed allowable min-error, increasing required max-error increases accuracy while decreasing the proportion of SNPs phased (Supplementary Figure 6 and Supplementary Figure 7). Additionally, we see for fixed ε m and ε M , increasing coverage increases accuracy. In this context, for a fixed pair of error thresholds, the total number of SNP pairs able to be phased decreases as required coverage increases, though the proportion of SNP pairs (out of SNP pairs satisfying the lower bound of required coverage) which satisfy the error thresholds is noisy, but relatively flat when varying coverage.
At this time, it seems that certain pairs of SNPs can be accurately phased in the presence of multiple isoforms: those which satisfy various error and coverage thresholds. It may be useful to incorporate this sort of model into the general HapTree-X framework. To phase an entire gene of (possibly more than two) SNPs simultaneously, we can employ linear programming for each potential haplotype solution S for the gene to determine whether there exists z = δM S with |δ| ≤ 1 2 , or more generally if there exists a solution based on the error model for non-uniform differential expression in subsubsection 3.1.6. This approach unfortunately involves enumerating 2 n−1 haplotypes, where n is the number of SNPs in the proposed gene, though is certainly feasible for some small n > 2 (and likely most genes).

Results
PolyA  Ne g a tive Lo g -Like lih o o d of Co n co rd a n t Ex p re s s io n Co n co rd a n t Ex p re s s io n Ra t e