Quantification of differential gene expression by multiplexed targeted resequencing of cDNA

Whole-transcriptome or RNA sequencing (RNA-Seq) is a powerful and versatile tool for functional analysis of different types of RNA molecules, but sample reagent and sequencing cost can be prohibitive for hypothesis-driven studies where the aim is to quantify differential expression of a limited number of genes. Here we present an approach for quantification of differential mRNA expression by targeted resequencing of complementary DNA using single-molecule molecular inversion probes (cDNA-smMIPs) that enable highly multiplexed resequencing of cDNA target regions of ∼100 nucleotides and counting of individual molecules. We show that accurate estimates of differential expression can be obtained from molecule counts for hundreds of smMIPs per reaction and that smMIPs are also suitable for quantification of relative gene expression and allele-specific expression. Compared with low-coverage RNA-Seq and a hybridization-based targeted RNA-Seq method, cDNA-smMIPs are a cost-effective high-throughput tool for hypothesis-driven expression analysis in large numbers of genes (10 to 500) and samples (hundreds to thousands).

Comparison of reproducibility of differential expression estimates of cDNA-smMIPs and RNA-Seq at the same depth of sequencing. Nonoverlapping subsets of respectively 20,000 read pairs (panel a, b)   and HEK293 were homozygous for the opposite allele.
A) The relative allelic ratio of subsequent dilution steps, across all dilution steps. For a given smMIP and dilution step d, the allelic ratio is defined as the fraction AR = NK562/(NK562+NHEK293). The relative allelic ratio of dilution step d is then given by AR(d) / AR(d-1). This value is always expected to be 0.75, regardless of the initial expression ratios of the target transcript in K562 and HEK293 cells.
B) The relative allelic ratio stratified by dilution step. The horizontal axis represents the expected allelic ratio relative to the first dilution step (i.e., AR(d)/AR(1)), and illustrates that the absolute difference in the allelic ratio of subsequent dilution steps (AR(d)/AR(d-1), vertical axis) also becomes exponentially smaller. Given that the ratio is estimated from molecule counts, the variation in the allelic ratio relative to the previous dilution step becomes increasingly larger as the number of dilutions increases. The red line indicated the expected allelic ratio relative to the previous dilution step.

Bayesian hierarchical model to estimate differential expression
We constructed a statistical model to integrate observations from replicates into a single estimate of expression, and to quantify uncertainty in estimates of differential expression. We used a negative binomial distribution to model the unique molecule counts. We defined expression for a given probe as the logarithm of the mean of the negative binomial distribution; this expression is corrected for probe bias and normalized for sequencing depth. We assume that the overdispersion factor (relation between mean and variance) is the same for all probes. We allow for heterogeneity in the dispersion factor between experiments, as our results indicate that some experiments show more variability than others. The probe bias is estimated from differences in normalized counts (molecules per million molecules) between replicates and then used as a covariate in the model. We used Stan 5 to perform inference in this model using Markov chain Monte Carlo sampling. Stan was run independently for each condition (a condition is defined as a set of replicate experiments) to generate 1000 independent samples. We used these samples from the posterior distribution to estimate differential expression between conditions.

Mathematical formulation
To estimate differential expression for two conditions, we correct each pair of samples from the Markov chain for the probe bias using ordinary least squares regression.
We define ! !" ! as the number of molecules counted for probe ! in condition ! and replicate !. Thus, the pair of subscripts (!, !) thus uniquely defines an experiment, where ! = 1, … , ! ! , with ! ! the number of replicates for condition !, and ! = 1, … , !, with ! the number of conditions, and ! = 1, … , !, with ! the number of smMIP probes. We define ! ! ! as the expression value for probe ! in condition !, corrected for probe bias ! ! and normalized for differences in sequencing depth through library size factors ! !" . The counts for different replicates in the same condition directly depend on this shared expression value. This is the mechanism by which observations from multiple replicates are combined to obtain accurate estimates.
The joint probability of the variables in the model conditional on fixed parameters is given by: The probe biases form a matrix ! ! ! where ! indexes the !th bias vector estimated from the data as described below. ! !" ! is coefficient for the !th bias-vector on experiment (!, !) and is a random variable in the model, where we arbitrarily choose We use the negative binomial distribution to model the smMIP molecule counts: The expected value ! !" ! for the negative binomial depends on the normalized expression value ! ! ! , the library size factors and bias influences ! as follows: Thus, the influence (regression coefficient) of the probe bias ! ! ! is modelled by a random continuous variable in the model. While we can infer influence of this bias from differences between replicate experiments, there is no information to decide which experiment actually was closest to the true expression value. Therefore we arbitrarily set the influence of the probe bias to zero for the first replicate in a given condition. However, this introduces an arbitrary offset to the normalized expression values ! ! ! . As a consequence, when we estimate differential expression between two conditions, we again correct for the probe bias.
The mean-variance relationship for the negative binomial is given by where ! !! is a hyperparameter with a ! prior distribution: As a result, the overdispersion relation is the same for all probes but may vary between experiments. This is in part necessary because of the shared mean between replicate experiments, which may increase variance in one replicate, and in part due to actual experimental variability between. Here our model differs from that of DESeq 6 , which allows the overdispersion factor to vary between genes.
The library size factors ! !" are estimated from the molecule counts following the DESeq approach 6-8 : where ! ! represents the total number of experiments ! ! = ! ! ! . As expression values ! ! ! are to be compared between conditions they must be on a common scale.
Therefore the size factors are estimated from all experiments jointly.
As we perform MCMC inference we do not optimize the nuisance parameters ! and ! but integrate over the possible values consistent with the observations ! and fixed parameters ! and !.

Estimation of differential expression
The joint distribution fully factorizes into conditions !, as only constant parameters are shared between conditions. We therefore perform MCMC inference independently for each condition. This generates samples ! = 1, … , ! with ! the number of MCMC samples (we used 1000 independent samples). A naive estimate of differential expression between conditions ! = 1 and ! = 2 would consist of taking the average difference across samples: However, because we have set an arbitrary offset to the expression levels by setting the probe bias influence to zero for the replicate, ! !,!!! = 0, we need to again correct differential expression estimates for the possible presence of probe bias. For computational efficiency we obtain an estimate of differential expression Δ !,! for each MCMC sample ! by correcting for the probe bias vectors ! ! using ordinary least squares regression (without overall mean effect) with different probes ! as observations: where ! ! ! ∼ !(0, ! ! ) is the unexplained error, and !ʹ ! ! are the regression coefficients for each bias vector ! and sample !. Thus, the !ʹ ! can be estimated from the ! = 1, … , ! probe-level expression estimates. The differential expression estimate for probe ! is then given by the corrected differential expression averaged over samples: quantify uncertainty in the estimates of differential expression.
Although technically it would be preferable to jointly model all conditions in a single model, we found that in practice this resulted in slow convergence and poor performance.

Estimation of probe bias
We used an empirical Bayes approach to estimate the values for the probe bias matrix ! ! ! . The idea is to compare the difference in log-normalized counts between one pair of replicates with the difference in log-normalized counts between another independent pair of replicates. When the correlation across probes of these differences is very high, we assume we have identified a systemic probe bias.
First, we normalize the molecule counts by the total number of molecules counted for an experiment (!, !): between all pairs of replicates ! ! , ! ! belonging to the same condition !: Thus, we never compare replicates from different conditions. The number of vectors

cDNA-smMIP experimental procedures
The cDNA smMIP experimental procedure described below is largely based on the smMIP protocol developed for genomic DNA 4,9 . All reagents used are specified in Supplementary Table 19; all equipment used is specified in Supplementary Table   20.

1) MIP pooling
We used three independent cDNA-smMIP pools for experiments testing differential expression of ERCCs (337 smMIPs), differential expression of EBVs and PBMCs (95 smMIPs), and allele specific expression (64 smMIPs). The detailed protocol below describes the experiment we used to measure differential expression with EBVs. In these experiments 5 µl of each of the 95 smMIPs were pooled into one single tube.

2) MIP Phosphorylation
An aliquot representing 0.5 µl per smMIP of the 1x pool was used for phosphorylation. Volumes of other components were calculated according the example in the Supplementary Table 8.
If necessary split the volume into more PCR tubes, not exceeding 100 µl/tube for optimal thermal conditions; Run the PCR program shown in Supplementary Table 9 on a PCR machine:

3) MIP Capture
The previously published protocol for genomic DNA 4 describes a MIP to gDNA molecule ratio of 800:1, corresponding to 264,000 MIP molecules per ng gDNA. For cDNA smMIPs we used 10-fold more smMIPs, to compensate for the higher amount of RNA than DNA molecules per cell, i.e. 2,640,000 smMIPs per ng cDNA.
1. Pipette 10 ng of cDNA in 10 µl (H2O) in plate or strip tube; add a blanc with 10 µl H2O. The cDNA input amount may differ per celltype, for several samples we used 3 different input amounts, e.g. 1 ng, 10 ng, 50 ng.
2. Make a MIP capture mix for at least 30 reactions (example in Supplementary