A scalable unified framework of total and allele-specific counts for cis-QTL, fine-mapping, and prediction

Liang, Yanyu; Aguet, François; Barbeira, Alvaro N.; Ardlie, Kristin; Im, Hae Kyung

doi:10.1038/s41467-021-21592-8

Download PDF

Article
Open access
Published: 03 March 2021

A scalable unified framework of total and allele-specific counts for cis-QTL, fine-mapping, and prediction

Nature Communications volume 12, Article number: 1424 (2021) Cite this article

5139 Accesses
14 Citations
18 Altmetric
Metrics details

Subjects

Abstract

Genetic studies of the transcriptome help bridge the gap between genetic variation and phenotypes. To maximize the potential of such studies, efficient methods to identify expression quantitative trait loci (eQTLs) and perform fine-mapping and genetic prediction of gene expression traits are needed. Current methods that leverage both total read counts and allele-specific expression to identify eQTLs are generally computationally intractable for large transcriptomic studies. Here, we describe a unified framework that addresses these needs and is scalable to thousands of samples. Using simulations and data from GTEx, we demonstrate its calibration and performance. For example, mixQTL shows a power gain equivalent to a 29% increase in sample size for genes with sufficient allele-specific read coverage. To showcase the potential of mixQTL, we apply it to 49 GTEx tissues and find 20% additional eQTLs (FDR < 0.05, per tissue) that are significantly more enriched among trait associated variants and candidate cis-regulatory elements comparing to the standard approach.

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain

Article Open access 09 April 2024

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Introduction

Genome-wide association studies (GWAS) have identified tens of thousands of genomic loci associated with complex traits, but most of these loci lie in noncoding regions of the genome, indicating transcriptome regulation as a potential key driver of disease biology. Multiple methods have been developed to integrate GWAS results with expression quantitative trait loci (eQTLs) and inform mechanisms underlying GWAS loci. Two strategies are commonly employed: (1) association-based approaches including PrediXcan¹, fusion², and smr³; and (2) colocalization-based approaches including coloc⁴, eCAVIAR⁵, and enloc⁶. Association-based approaches correlate genetic predictors of gene expression with complex traits of interest. Colocalization-based approaches rely on high-quality eQTL mapping and fine-mapping results to identify potentially causal genes.

In addition to gene expression levels measured by total read counts, allele-specific expression (the relative expression difference between the two haplotypes) provides valuable additional information that can be leveraged to improve eQTL mapping and fine-mapping. Several methods have been proposed to combine total and allele-specific read counts for QTL mapping, such as TReCASE⁷, WASP⁸, and RASQUAL⁹). However, running these methods on sample sizes beyond a few hundred is generally computationally intractable, and as a result they have not been applied to large-scale studies like GTEx, which includes over 15,000 samples across 49 tissues. For fine-mapping, two approaches that combine both allele-specific expression and eQTL mapping via meta-analysis have been recently proposed^10,11. However, to our knowledge, no existing method provides a scalable unified framework combining total and allele-specific counts with explicit multi-SNP modeling for QTL mapping, fine-mapping, and prediction.

By assuming a log-linear model for transcript expression levels with independent reads from each haplotype and weak genetic effects, as proposed in ref. ¹², we derive two approximately independent equations for allelic imbalance (read count ratio between the two haplotypes) and total read count. In this work, we develop a unified framework and computationally efficient algorithms combining total and allele-specific reads for QTL mapping, fine-mapping, and prediction. We demonstrate the resulting gain in performance with simulations under a range of different settings, applications to GTEx v8 data¹³, and comparisons to a large-scale eQTL meta-analysis from eQTLGen¹⁴. We also generated mixQTL results for the full set of GTEx data and make this resource publicly available. The software, simulation, data preprocessing, and analysis pipelines can be found at https://github.com/hakyimlab/mixqtl¹⁵, https://github.com/liangyy/mixqtl-pipeline¹⁶, and https://github.com/liangyy/mixqtl-gtex¹⁷. A computationally efficient GPU-based implementation of mixQTL has been embedded in tensorQTL https://github.com/broadinstitute/tensorqtl¹⁸.

Results

Overview of the statistical model

To develop a computationally efficient approach that integrates total and allele-specific count data, we assumed multiplicative cis-regulatory effects and noise, similarly to the model proposed in ref. ¹². For a given gene, we modeled the haplotypic read count ${Y}_{i}^{h}$, which is the number of reads from haplotype h of individual i as

$${Y}_{i}^{h}={L}_{i}\cdot {\theta }_{0,i}\cdot \exp (\beta \cdot {X}_{i}^{h})\cdot \exp ({\epsilon }_{i}^{h}),$$

(1)

where L_i is the library size for individual i, θ_0,i is the baseline abundance (for a haplotype with the reference allele), $\exp (\beta )$ is the cis-regulatory effect (allelic fold change due to the presence of the alternative allele), ${X}_{i}^{h}$ indicates the dosage of the variant (0 if the individual has the reference allele, and 1 if they have the alternative one), and $\exp ({\epsilon }_{i}^{h})$ is the multiplicative noise.

Calculating the total read count as the sum of the two haplotypic counts and assuming weak cis-regulatory effects, we derived an approximately linear model for the logarithm of the haplotypic and total read counts (see details in Methods and Supplementary Notes 1). In practice, we only observe the allele-specific reads that include a heterozygous site denoted as ${Y}_{i}^{(h)\,\text{obs}\,}={\alpha }_{i}\cdot {Y}_{i}^{h}$, which is a fraction of the total haplotypic count. To take this partial readout into account, we modeled the observed total and allele-specific counts as

$${\mathrm{log}}\,{Y}_{i}^{(1)\,\text{obs}\,} = {\mathrm{log}}\,{L}_{i}+{\mathrm{log}}\,{\alpha }_{i}+{\mathrm{log}}\,{\theta }_{0,i}+{X}_{i}^{1}\beta +{\epsilon }_{i}^{(1)}\hfill\\ {\mathrm{log}}\,{Y}_{i}^{(2)\,\text{obs}\,} = {\mathrm{log}}\,{L}_{i}+{\mathrm{log}}\,{\alpha }_{i}+{\mathrm{log}}\,{\theta }_{0,i}+{X}_{i}^{2}\beta +{\epsilon }_{i}^{(2)}\hfill\\ {\mathrm{log}}\,\,\frac{{Y}_{i}^{\,\text{total}\,}}{2} \approx {\mathrm{log}}\,{L}_{i}+{\mathrm{log}}\,{\theta }_{0,i}+\frac{{X}_{i}^{1}+{X}_{i}^{2}}{2}\beta +{\epsilon }_{i}^{\,\text{trc}\,}$$

(2)

where the error terms are ${\epsilon }_{i}^{\,\text{trc}\,} \sim N(0,\frac{{\sigma }^{2}}{{Y}_{i}^{\,\text{total}\,}})$, ${\epsilon }_{i}^{(h)} \sim N(0,\frac{{\sigma }^{2}}{{Y}_{i}^{(h)\,\text{obs}\,}})$ and the errors of the two haplotypes are independent: ϵ⁽¹⁾⊥⊥ϵ⁽²⁾. Here, we let the ϵ terms have variance inversely proportional to the actual count and by doing so, we ensure that the variance of the count scales approximately linearly to the mean of the count as demonstrated in Supplementary Notes 1.2.

We further simplified the models by combining the two allele-specific counts and defining the baseline abundance variation as a random effect z_i ($\mathrm{log}\,{\theta }_{0,i}$ = population mean + z_i). Then, we merge the total count term ${\epsilon }_{i}^{\,\text{trc}\,}$ and z_i into one term ${\widetilde{z}}_{i}$ (since ${\epsilon }_{i}^{\,\text{asc}\,}$ is approximately independent from both of them. See Methods and Supplementary Notes 4.1). The final model is

$${\mathrm{log}}\,\frac{{Y}_{i}^{(1)\,{\text{obs}}\,}}{{Y}_{i}^{(2)\,{\text{obs}}\,}}\ =\ ({X}_{i}^{1}-{X}_{i}^{2})\beta \ +{\epsilon }_{i}^{\,\text{asc}\,}\ \,{({\rm{allelic}}\ {\rm{imbalance}}\ {\rm{eq.}})}\,$$

(3)

$${\mathrm{log}}\,\frac{{Y}_{i}^{\,\text{total}\,}}{2{L}_{i}}\ \approx {\mu }_{0}+\ \frac{{X}_{i}^{1}+{X}_{i}^{2}}{2}\beta \ +{\widetilde{z}}_{i}\ \,{({\rm{total}}\ {\rm{read}}\ {\rm{count}}\ {\rm{eq.}})}\,$$

(4)

where ${\widetilde{z}}_{i} \sim N(0,{\widetilde{\sigma }}_{0}^{2})$ and ${\epsilon }_{i}^{\,\text{asc}\,} \sim N(0,{\sigma }^{2}\cdot (\frac{1}{{Y}_{i}^{(1)\,\text{obs}\,}}+\frac{1}{{Y}_{i}^{(2)\,\text{obs}\,}}))$ and ${\widetilde{z}}_{i}$ is approximately independent from ϵ^asc.

This single-SNP model extends to multiple SNPs in a straightforward manner by using a vector of allelic dosages (X_i1, ⋯ , X_ip) and genetic effects (β₁, ⋯ , β_p) instead of the scalar values above. Here, p represents the number of genetic variants in the cis-window of the gene under consideration (Supplementary Notes 3 and 5).

For cis-QTL mapping, we took advantage of the approximate independence of the allelic imbalance and the total read counts in Eqs. (3) and (4), solving them as separate linear regressions (for computational efficiency) and combining the results via inverse-variance weighted meta-analysis. We call this method mixQTL.

For the fine-mapping and prediction problems, we also leveraged the approximate independence of the allelic imbalance and total read count equations. We used a two-step approach in which we first scale the two equations so that they become independent data points with equal variances. In the second step, we combined these data points into an augmented dataset and applied the existing algorithms SuSiE¹⁹ and elastic net²⁰. We term these methods mixFine and mixPred, for fine-mapping and prediction, respectively.

Simulation of total and allele-specific reads

To assess the benefits of this unified framework over using only total read counts or allele-specific expression, we simulated haplotypic reads according to the framework illustrated in Fig. 1, with additional details in Methods and Supplementary Notes 6. For mixQTL, we simulated data with a single causal variant and for mixPred and mixFine, we simulated data with 1–3 causal variants.

**Fig. 1: Simulation scheme for total and allele-specific read counts.**

For all simulation settings, we set an average library size of 94 million reads (to approximately match GTEx v8 library sizes) and used a series of expression levels (expected value of θ_0,i in Eq. (1)): from 50 to 1 read per million, corresponding to θ = 5 × 10⁻⁵–10⁻⁶. The fraction of allele-specific reads was kept at consistent levels across simulations by using the same distribution of polymorphic sites per individual.

Combining total and allele-specific read counts improves cis-eQTL mapping

To assess the gain in power of combining total and allele-specific read counts, we simulated 200 replicates with allelic fold change varying among 1, 1.01, 1.05, 1.1, 1.25, 1.5, 2, 3. We compared mixQTL with two methods: using either only allele-specific counts (ascQTL) or total counts (trcQTL). See details in Supplementary Notes 4.1.

All three methods had calibrated type I errors (Fig. 2a and Supplementary Fig. 1). mixQTL outperformed both trcQTL and ascQTL in all simulation settings, demonstrating the benefits of combining total and allele-specific counts for cis-eQTL mapping (Fig. 2b and Supplementary Fig. 2).

**Fig. 2: QTL mapping performance for mixQTL and approaches based on either total reads (trcQTL) or allele-specific reads (ascQTL) on simulated data.**

The power of ascQTL was sensitive to the number of allele-specific reads, as expected. As shown in Fig. 2b, with θ controlling the expression level, ascQTL yielded much higher power for higher expression levels. In contrast, trcQTL was less sensitive to the number of reads observed under the range of read counts in our simulation settings. Such sensitivity differences between ascQTL and trcQTL are consistent with the nature of count data, where the magnitude of the noise is inversely related to the count.

Combining total and allele-specific read count improves fine-mapping

To realistically simulate LD structure, we used the genotypes of European individuals from the 1000 Genomes projects phase 3²¹ within ±1 MB cis-windows of 100 randomly selected genes. We applied mixFine and trcFine (which uses total read counts only; Supplementary Notes 5.3) to the simulated data and characterized the fine-mapping results with two metrics: (1) power curve, defined as the proportion of detected variants among causal ones versus the number of detected variants, where detection was defined as the variant having posterior inclusion probability (PIP) > threshold (which is varied to get the desired number of detected SNPs); (2) the size of the 95% credible set (CS), which contains the causal variant. The PIPs of both trcFine and mixFine were consistent with the proportion of true causal variants within each PIP bin (Fig. 3a). By combining total and allele-specific reads, mixFine achieved higher power than trcFine (Fig. 3b and Supplementary Fig. 4) across almost all simulation settings. mixFine achieved the highest improvement relative to trcFine at a high expression level (θ), corresponding to high-quality allele-specific signals. The gain in power decreased with larger sample sizes. The increased power was also reflected in the number and size of 95% CSs containing the true signals. As shown in Fig. 3c and Supplementary Fig. 5, mixFine identified more true positive 95% CSs, and these 95% CSs were generally smaller than the ones of trcFine (paired t-test p = 5.88 × 10⁻²⁹) demonstrating that mixFine can pinpoint causal SNPs more accurately.

**Fig. 3: Fine-mapping performance of the combined (mixFine) and total read-based (trcFine) approaches on simulated data.**

Overall, the combined method was more powerful for identifying causal variants, which is consistent with recent reports^10,11.

Combining total and allele-specific read count improves prediction

Using the data from the fine-mapping simulation, we tested the performance of mixPred and trcPred (Supplementary Notes 5.3) on held-out test data. Specifically, we split each simulation replicate into training (4/5) and test (1/5) sets. We trained prediction models using training data and evaluated the prediction performance on test data using Pearson correlation between predicted and true responses. For each dataset, we repeated the splitting-training-evaluation procedure twice to reduce the stochasticity introduced by splitting.

Overall, mixPred achieved higher prediction accuracy than trcPred (Fig. 4 and Supplementary Figs. 6 and 7). The gain in performance was more apparent when the expression level θ was higher and as a consequence the allele-specific count was larger.

**Fig. 4: Prediction performance of the combined (mixPred) and total read-based (trcPred) methods on simulated data.**

mixQTL outperforms standard eQTL mapping in GTEx data

Next, we compared mixQTL to the standard eQTL mapping approach (denoted here simply as eQTL) used by the GTEx consortium¹³, using 670 whole-blood RNA-seq samples from the v8 release (see Methods). We included variants within a ±1 Mb cis-window around the transcription start site of each gene. Although mixQTL can be applied to all genes regardless of the number of allele-specific counts, we focus on examining the benefit of integrating allele-specific information and therefore limit these comparisons to genes with sufficient allele-specific counts, based on the following criteria: (1) at least 15 samples having at least 50 allele-specific counts for each haplotype; and (2) at least 500 samples having a total read count of at least 100. Five thousand seven hundred and thirty four (28%) genes passed these filters. We then stratified these genes by their median expression level (read counts) into low, medium, and high expression tertiles. For genes with below-threshold allele-specific counts, the calculation can be performed using total read counts only, such that all genes considered using the standard approach are also tested in mixQTL. Performance for these genes was similar to the standard eQTL approach (Supplementary Fig. 8).

All three approaches mixQTL, aseQTL, and trcQTL were relatively well-calibrated when permuting data in four randomly selected genes (Supplementary Fig. 9). The estimated effect sizes were consistent with allelic fold change estimates from the main GTEx v8 analysis (Supplementary Fig. 10).

To further compare the performance of the methods, we used eQTLGen¹⁴, a large-scale meta-analysis of over 30,000 blood samples, as our “ground truth” eQTL discovery reference (Supplementary Notes 8). We selected a random subset of 100,000 variant/gene pairs tested by eQTLGen with FDR < 0.05 as the set of “ground truth” eQTLs. We also selected a random set of 100,000 variant/gene pairs with p > 0.50 as a background set of “non-significant” eQTLs. Among these pairs, 96,660 and 78,691 of the “ground truth” and “non-significant” pairs had matching data in GTEx.

For the “ground truth” eQTLs, mixQTL yielded more significant p-values compared to the standard eQTL, ascQTL, and trcQTL approaches (Fig. 5). The “non-significant” variant/gene pairs showed moderate enrichment for small p-values for all methods (Fig. 5b), likely reflecting a combination of false negatives in eQTLGen and potential false positives in our analysis. Overall, we found that mixQTL achieves increased power compared to standard eQTL mapping on real data for the set of genes with sufficient total and allele-specific read counts.

**Fig. 5: Performance of mixQTL on GTEx v8 whole-blood RNA-seq.**

As an intuitive measure of improved performance, we estimated the effective sample size gain of mixQTL compared to standard eQTL mapping as the median of the ratio between mixQTL χ² statistics and eQTL χ² statistics. mixQTL showed a 29% increase in effective sample size compared to the standard eQTL mapping approach (Fig. 5c).

To account for the trade-off between true- and false-positive rates, as well as between precision and power, we used receiver operating characteristic (ROC) and precision-recall (PR) curves to compare the performance of mixQTL and standard eQTL approaches using the eQTLGen “ground truth” and “non-significant” eQTLs. We found that mixQTL achieves higher performance in both ROC (Fig. 5d) and PR curves (Fig. 5e). Consistent with simulation results, this gain is more significant for genes with higher expression levels.

To determine whether the eQTLGen-based analysis above depended on the selected random subset of cis-eQTLs, we repeated the analysis for multiple samplings of eQTLGen results and found no substantive differences in the results.

mixQTL is scalable to full GTEx eQTL analysis

To compare the performance and computational cost of mixQTL and the existing QTL mapping approaches which can leverage both total and allele-specific counts, we ran RASQUAL on two of the GTEx tissues, kidney cortex (sample size = 73; a subset of 4596 genes) and whole blood (a subset of 192 genes; Supplementary Notes 9). We observed concordant effect size estimates (Supplementary Fig. 11A). As expected, because RASQUAL models counts directly instead of approximating them with a log-linear model, it yielded more significant results than mixQTL (Supplementary Fig. 11B). On average, RASQUAL took 47 seconds per gene in kidney cortex and 826 seconds per gene in whole blood whereas mixQTL took 0.065 seconds (723 times faster) and 0.33 seconds (2480 times faster), respectively.

Given this computational efficiency, we decided to run mixQTL on the 49 tissues from the GTEx v8 release. This corresponded to 15,201 samples in total, and took ~54 CPU hours in total (without permutations).

mixQTL’s runtime scaled linearly as a function of sample size (Supplementary Fig. 12A), with the tissue with the largest sample size (skeletal muscle, n = 706) taking 0.34 seconds per gene on average.

At FDR cutoff 0.05, on average, mixQTL identified 1440 more genes and about 618,000 more eQTLs than the standard eQTL approach (Supplementary Fig. 12B and C). The full summary statistics of mixQTL are publicly available (Supplementary Data 1).

Fine-mapping and prediction model building in GTEx data

We applied mixFine to the GTEx v8 whole-blood RNA-seq data, using the same subset of genes with high expression and allelic counts that were used in the comparison of mixQTL vs. standard eQTL approach above. We compared mixFine to the SuSiE fine-mapping approach¹⁹, applied to inverse normal transformed expression values in the standard eQTL mapping pipeline¹³. We corrected for sex, five genetic principal components, WGS platform, WGS library prep protocol (PCR), and 60 PEER factors. We refer to the latter as the “standard approach” below for simplicity.

To compare the power of causal variant detection, we performed a subsampling analysis on a random subset of 1000 genes. First, we defined “consensus SNPs” as the variants with PIP > 0.5 in both mixFine and the “standard approach” using all samples. Similarly, a variant was defined as “top SNP” if it was the most significant variant within the 95% CS for both mixFine and the “standard approach”. Then, we compared how well the “consensus SNPs” and “top SNPs” were detected by mixFine and the standard fine-mapping approach using only a subset of samples. We subsampled to 90%, 80%, ⋯ , 30% of samples, and repeated each random subsampling step 10 times.

Among the 1000 genes, there were 272 “consensus SNPs” being identified in the full data. At each subsampling level, mixFine, on average, detected more “consensus SNPs” than the standard approach (Fig. 6a) and performance improved most on the more highly expressed genes (top tertile) (Supplementary Fig. 13). Moreover, mixFine detected “top SNPs” in 95% CSs with an average size of 9.5 variants, whereas the corresponding 95% CS from the standard approach had 14.6 variants on average (Supplementary Fig. 14). Furthermore, since the power gain would be more apparent in small sample sizes, we ran mixFine and standard eQTL approach in 26 GTEx v8 tissues with sample size <260. We examined the enrichment of the top QTL and PIP in different functional annotations, including regulatory element annotations, candidate cis-regulatory elements (cCREs)²², and the GWAS catalog (Supplementary Notes 10). We found that the variants with the most significant mixQTL p-value or the highest mixFine PIP were more enriched in GWAS catalog variants and cCREs than the standard approach. We found enrichment of enhancer, promoter, and transcription factor binding sites but the difference in enrichment between mixQTL and standard QTL methods was not significant (Supplementary Fig. 16). The reduced enrichment compared to cCREs are likely due to the fact that we used tissue-specific annotations for cCREs and cross-tissue annotations for enhancers, promoters, and TFs. These results indicate that, when sufficient counts are available, mixFine, the multi-SNP model combining total and allele-specific counts, can better pinpoint causal cis-eQTLs than the standard approach on real data.

**Fig. 6: Performance of mixFine and mixPred on GTEx v8 whole-blood RNA-seq.**

To compare the performance of mixPred and the standard method on real data, we implemented a cross-validated evaluation pipeline where we split the GTEx v8 whole-blood data into k folds. At each fold, we trained the prediction model using one fold of the data and evaluated the performance (by Pearson correlation between predicted and observed ${\mathrm{log}}\,({Y}_{i}^{\,\text{total}\,}/{L}_{i})$) on the remaining (k − 1) folds. We applied this evaluation pipeline to mixPred and the standard approach (elastic net as in ref. ¹) on the same 1000 genes as the subsampling analysis with k = 10, 9, …, 2 (corresponding to sample size = 67, 75, ⋯ , 335). At the same sample size, we observed, on average, significantly higher performance in mixPred as compared to the standard approach, and the performance gain was greater for smaller sample sizes (Fig. 6b and Supplementary Table 1).

Discussion

We proposed a unified framework that integrates both allele-specific and total read counts to estimate genetic cis-regulatory effects, resulting in improved eQTL mapping, fine-mapping, and prediction of gene expression traits. Our suite of tools (mixQTL, mixFine, and mixPred) can be scaled to much larger sample sizes (thousands) due to the underlying log-linear approximation. By assuming weak multiplicative genetic effects consistent with observations (most estimated log allelic fold changes of cis-eQTLs have a median absolute value of 0.153 and a 95th percentile of 0.845 (Supplementary Fig. 15)), we transform the observed read counts into two approximately independent quantities: allelic imbalance and total read count. Leveraging this independence, we developed computationally efficient approaches that integrate both allele-specific and total reads.

Specifically, mixQTL estimates the genetic effect separately for allelic imbalance and total read counts, and combines the resulting statistics via meta-analysis. These calculations have computationally efficient closed-form solutions, enabling their use in permutation schemes applied to compute FDR in eQTL mapping^23,24,25.

Furthermore, the simple multi-SNP extension and the approximate independence of the terms enable use of a two-step inference procedure. In the first step, the allelic imbalance and total read count are scaled such that the error terms have the same variance. And in the second step, given their approximate independence, the pair of equations (from allelic imbalance and total counts) can simply be input into existing fine-mapping and prediction algorithms.

We showed through simulations and applications to GTEx v8 data that our suite of methods outperforms methods that rely on total read counts alone. Compared to existing QTL mapping methods that integrate total and allele-specific reads, such as RASQUAL⁹, mixQTL has slightly lower power (Supplementary Fig. 11B). This is expected since RASQUAL models count data directly and mixQTL relies on approximations. However, the computational burden of RASQUAL is prohibitive for large datasets. In practice, the most suitable approach will depend on computational capacity and sample sizes. For datasets with small sample sizes (e.g., fewer than 100 samples), RASQUAL or WASP remain preferable. The computational efficiency of mixQTL makes it applicable to large sample sizes, and, moreover, enables using the mixQTL model in place of the standard eQTL mapping approach that relies on inverse normal transformed counts.

Given the unified modeling framework and computationally scalable tools proposed here, we anticipate that combining total and allele-specific read counts will find widespread use for cis-QTL mapping, fine-mapping, and prediction of gene expression.

Methods

Notation and terminology

It is described in Table 1.

Table 1 Summary of notation and terminology used in the paper.

Full size table

Statistical model of cis-regulation

For individual i, let ${X}_{i}^{1}$ and ${X}_{i}^{2}$ be the number of alternative alleles in each of the two haplotypes at the variant of interest. Let ${Y}_{i}^{1}$ and ${Y}_{i}^{2}$ be the number of reads coming from each of the two haplotypes (i.e., haplotypic counts; in practice, these quantities are unobserved) and L_i the library size for individual i. As proposed in ref. ¹², we use the concept of allelic fold change (aFC) to represent the genetic effect on cis-expression. We denote θ_0,i as the baseline abundance of the transcripts originating from each of the gene haplotype without considering genetic effect. Let β be the genetic effect of a variant of interest, which is defined as the log fold change relative to the reference allele. Then, the transcript abundance of each haplotype h after accounting for the genetic effect is ${\theta }_{i}^{h}={\theta }_{0,i}\times g(\beta ,{X}_{i}^{h})$ where $g(\beta ,{X}_{i}^{h})$ is e^β if ${X}_{i}^{h}$ is the alternative allele; otherwise $g(\beta ,{X}_{i}^{h})=1$. We model read count ${Y}_{i}^{h}$ as

$${\mathrm{log}}\,{Y}_{i}^{h}| {L}_{i},{\theta }_{i}^{h} \sim N({\mathrm{log}}\,({L}_{i}{\theta }_{i}^{h}),{\tau }_{i}^{h}).$$

(5)

In an RNA-seq experiment, a fraction of reads contribute to allele-specific read counts. Let α_i denote the fraction of allele-specific reads in individual i, which depends on the number of heterozygous sites within the transcript. Instead of observing haplotypic counts ${Y}_{i}^{1}$ and ${Y}_{i}^{2}$, we observe total read count ${Y}_{i}^{\,\text{total}\,}$ and gene-level allele-specific read counts ${Y}_{i}^{(1)\,\text{obs}\,}$ and ${Y}_{i}^{(2)\,\text{obs}\,}$. Similarly, we further assume that the baseline abundance of allele-specific reads per haplotype is θ_0,i × α_i, so we have

$${\mathrm{log}}\,{Y}_{i}^{(1)\,{\text{obs}}\,}| {L}_{i},{\theta }_{i}^{1},{\alpha }_{i}\ \sim N({\mathrm{log}}\,({\alpha }_{i}{L}_{i}{\theta }_{i}^{1}),{\tau }_{i}^{(1)})$$

(6)

$$\begin{array}{*{20}{l}}{\mathrm{log}}\,{Y}_{i}^{(2)\,{\text{obs}}\,}| {L}_{i},{\theta }_{i}^{2},{\alpha }_{i}\hfill & \sim \hfill & N({\mathrm{log}}\,({\alpha }_{i}{L}_{i}{\theta }_{i}^{2}),{\tau }_{i}^{(2)})\hfill\\ {\mathrm{log}}\,{Y}_{i}^{\,\text{total}\,}| {L}_{i},{\theta }_{i}^{1},{\theta }_{i}^{2}\hfill & = \hfill& {\mathrm{log}}\,({Y}_{i}^{1}+{Y}_{i}^{2})| {L}_{i},{\theta }_{i}^{1},{\theta }_{i}^{2}\hfill\end{array}$$

(7)

$$\sim N({\mathrm{log}}\,[{L}_{i}({\theta }_{i}^{1}+{\theta }_{i}^{2})],{\tau }_{i})$$

(8)

Linearizing the model by approximation

Based on the model described above along with approximations under weak effect assumptions, we propose the following linear mixed effects model (see Supplementary Notes 2 for derivation):

$${{\underbrace{{\mathrm{log}}\,\frac{{Y}_{i}^{\text{total}}}{2{L}_{i}}}}\atop {{Y}_{i}^{\text{trc}}}}={\mu }_{0}+z_i+{{\underbrace{\frac{{X}_{i}^{1}+{X}_{i}^{2}}{2}}} \atop {{X}_{i}^{\text{trc}}}}\beta +{\epsilon }_{i}^{\text{trc}\,}$$

(9)

$${{\underbrace{{\mathrm{log}}\,\frac{{Y}_{i}^{(1)\text{obs}}}{{Y}_{i}^{(2)\text{obs}}}}}\atop {\begin{array}{c}{Y}_{i}^{\text{asc}}\end{array}}}={{\underbrace{({X}_{i}^{1}-{X}_{i}^{2})}}\atop{\begin{array}{c}{X}_{i}^{\text{asc}}\end{array}}}\beta +{\epsilon }_{i}^{\text{asc}\,}$$

(10)

$${z}_{i} \sim N(0,{\sigma }_{0}^{2}),\,{\epsilon }_{i}^{\,\text{trc}\,} \sim N(0,\frac{{\sigma }^{2}}{{Y}_{i}}),\,{\epsilon }_{i}^{\,\text{asc}\,} \sim N\left(0,{{\underbrace{\frac{{\sigma }^{2}{Y}_{i}^{(1)}{Y}_{i}^{(2)}}{{Y}_{i}^{(1)}+{Y}_{i}^{(2)}}}}\atop{\begin{array}{c}{\sigma }^{2}/{w}_{i}\end{array}}}\right),$$

(11)

where z_i is the individual-level random effect capturing the between-individual variation of θ_i,0. Notice that the individual-level random effect cancels out when we take the difference between the two log-scale allele-specific read counts (allelic imbalance in log scale). The scaling of ϵ^trc and ϵ^asc in Eq. (11) is to ensure that variance of read count scales linearly with the magnitude of read count (see Supplementary Notes 1.2). In other words, this model ensures Var(Y) ≈ constant × E(Y), such that over-dispersion is implicitly taken into account.

Since ${\epsilon }_{i}^{\,\text{asc}\,}$ is approximately independent to ${\epsilon }_{i}^{\,\text{trc}\,}$ (see Supplementary Notes 4), ${\epsilon }_{i}^{\,\text{trc}\,}$ and z_i can be merged into one term ${\widetilde{z}}_{i}$. So, we can further simplify Eqs. (9), (10) as

$${Y}_{i}^{\,\text{trc}}={\mu }_{0}+{X}_{i}^{\text{trc}}{\beta }^{\text{trc}}+{\widetilde{z}}_{i},{\widetilde{z}}_{i} \sim N(0,{\widetilde{\sigma }}_{0}^{2})$$

(12)

$${Y}_{i}^{\,\text{asc}}={X}_{i}^{\text{asc}}{\beta }^{\text{asc}}+{\epsilon }_{i}^{\text{asc}},{\epsilon }^{\text{asc}} \sim N(0,{\sigma }^{2}/{w}_{i})$$

(13)

Equations (12), (13) are applicable to both single SNP and multi-SNP scenarios. In the single-SNP case, X_i and β are scalars, and in the multi-SNP case, X_i and β are replaced by vectors including all SNPs within the cis-window (see Supplementary Notes 3).

Numerically efficient QTL mapping leveraging approximate independence of allelic imbalance and total read count

The likelihood function corresponding to the proposed model in Eqs. (12), (13) approximately takes the form

$$\mathop{\prod}\limits_{i}\Pr(Y_{i}^{{\rm{total}}}|u_{0},\widetilde{\sigma}^{2}_{0},\beta) \cdot \Pr \left(\frac{Y_{i}^{(1){\rm{obs}}}}{Y_{i}^{(2){\rm{obs}}}}|\sigma^{2},\beta\right),$$

factoring into total read count and allelic imbalance components. (see Supplementary Notes 2.2). This means that the likelihood for total read count and the ratio of allele-specific read counts provide approximately independent information on β, and enables us to solve each component separately and combine the results via meta-analysis (standard approach with independent studies²⁶). Specifically, we fit β^trc and β^asc using total and allele-specific observations as two separate linear regression problems, and meta-analyze the results using inverse-variance weighting (see details in Supplementary Notes 4.2).

Two-step inference procedure for multi-SNP model

The prediction and fine-mapping problems both rely on the linearized model Eqs. (12), (13), but with different objectives. For prediction, the objective is to find the best predictor, whereas for fine-mapping, the objective is to infer whether β_k is non-zero. Existing solvers for both prediction and fine-mapping use total read information only and assume that data (X, y) follows the model y = Xβ + ϵ, where the noise term ϵ is independent across the rows of the data matrix. We will refer to this model as the ‘canonical’ linear model. We propose a two-step inference procedure that first processes the data such that it approximates y = Xβ + ϵ, and then uses existing solvers for prediction and fine-mapping problems, respectively.

For the first step, we process total and allele-specific reads separately to fit the ‘canonical’ linear model. Specifically, we estimate σ² from (Y^asc, X^asc) based on Eq. (13) by further assuming the genetic effect as random effect and estimating σ² using R package EMMA²⁷. And similarly, based on Eq. (12) and the random effect assumption, we estimate ${\widetilde{\sigma }}_{0}^{2}$ from (Y^trc, X^trc). To account for the intercept term μ₀ in Eq. (12), we center Y^trc and X^trc by subtracting the mean values across all samples and then scale the centered (Y^trc, X^trc) by $1/{\widehat{\widetilde{\sigma }}}_{0}$. And similarly, we scale (Y^asc, X^asc) by $w/\hat{\sigma }$. These linear transformations ensure that the transformed $({\tilde{Y}}^{\text{trc}},{\tilde{X}}^{\text{trc}})$ and $({\tilde{Y}}^{\text{asc}},{\tilde{X}}^{\text{asc}})$ both approximately follow Y = Xβ + ϵ. The implementation details are described in Supplementary Notes 5. At the second step, we concatenate the transformed data from both total and allele-specific read counts as $(\tilde{Y},\tilde{X})$, which is compatible with existing solvers for prediction and fine-mapping problems.

Adjusting for covariates

When analyzing real data, we need to take covariates such as sex, batch effect, population stratification into account. Here, we adapt the procedure which has been proposed previously¹². We regress out the effect of covariates beforehand and use the residual as the response in both QTL mapping and fitting multi-SNP model. Specifically, let c₁, ⋯ , c_K denote the K covariates to be considered. We first regress Y^trc against c₁, ⋯ , c_K jointly and select the covariates with nominally significant coefficients (p < 0.05). Then we regress Y^trc against the selected covariates jointly and set the residuals as the adjusted Y^trc for QTL mapping and multi-SNP inference downstream.

Simulation scheme

We simulate RNA-seq reads with total and allele-specific readouts as sketched in three steps in Fig. 1. In step 1, we specify, for each individual i, the position of heterozygous sites within the gene body. The expected read count from each haplotype transcripts, $\,\text{E}\,({Y}_{i}^{h})$, is determined by the RNA-seq library size L_i, the baseline abundance of the transcript θ_0,i, and the genetic effect β. In step 2, given the expected haplotypic count, we draw ${Y}_{i}^{h}$ from Negative Binomial to model the variation among count data. In step 3, we position the reads randomly along the gene body and readout observed allele-specific count ${Y}_{i}^{(h)\,\text{obs}\,}$ by counting the number of reads overlapping heterozygous sites simulated in step 1. The total read count readout is ${Y}_{i}={Y}_{i}^{1}+{Y}_{i}^{2}$, which is independent of the number of heterozygous sites.

To survey a wide range of parameters, we simulate data with a grid of parameters. We vary sample size among 100, 200, ..., 500. At library size around 90 million, we vary the level of θ_0,i to cover the gene with different expression levels, among 5 × 10⁻⁵, 2.5 × 10⁻⁵, 1 × 10⁻⁵, 5 × 10⁻⁶, 2.5 × 10⁻⁶, 1 × 10⁻⁶. The genetic effect, aFC, is set to 1 (null), 1.01, 1.05, 1.1, 1.25, 1.5, 2, 3 in the single-SNP model. For the multi-SNP scenario, we set the number of causal SNPs between 1 and 3 with heritability from 0.2 to 0.55. The number of polymorphic sites within the gene body is centered around 10 with minor allele frequency from 0.05 to 0.3. A detailed description and parameter settings are provided in the Supplementary Notes 6.

Analysis of GTEx v8 data

We downloaded the phased genotypes, total read count matrix, and variant-level allele-specific read counts for whole blood from GTEx release 8¹³ via dbGaP (accession number phs000424.v8.p2). To obtain gene-level read counts, we summed over allele-specific counts at all the heterozygous sites for each gene haplotype. We also obtained library size, sex, and genotype PCs from GTEx v8. For comparisons with the inverse normalization-based approach, we also downloaded normalized expression matrices.

Similarly to the GTEx v8 analyses¹³, we restricted the analysis to the cis-regulatory window defined as 1Mbp up/downstream of the transcription start site of each gene.

To obtain the PEER factors for mixQTL analysis, we ran peertool²⁸ on a matrix with value $\mathrm{log}\,(\frac{{Y}_{i,g}}{2{L}_{i}})$ for individual i and gene g (imputed by k-nearest neighbors if Y_i,g is zero using impute::impute.knn in R).

We considered very large allele-specific counts to be likely alignment artifacts and removed individuals with allele-specific read counts greater than 1000. To further limit the influence of large count outliers on the estimated log fold change, ${\hat{\beta }}^{\text{asc}}$, we set the largest weight ${\left(\frac{1}{{Y}_{\,}^{(1)\text{obs}}}+\frac{1}{{Y}_{\,}^{(2)\text{obs}}}\right)}^{-1}$ to be at most K fold to the smallest one, where $K=\min (10,\,\text{sample size}\,/10)$.

Specific analyses focused on high or low expression were performed with different gene filtering criteria as stated in the Results section.

For analyses of the full GTEx v8 dataset, we built a data analysis pipeline at https://github.com/liangyy/mixqtl-gtex/tree/master/mixqtl which relied on the tensorQTL implementation of mixQTL. We included all genes regardless of expression level and analyzed the 22 autosomes for each of the 49 tissues. Specifically, since mixQTL can only work with non-zero total read count, we imputed the samples with missing total read count as 1. And in the mixQTL call, all total read counts were included and all allele-specific counts with more than 15 reads (on both haplotypes) were included.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The Genotype-Tissue Expression (GTEx) project’s raw whole-transcriptome and -genome sequencing data are available via dbGaP accession number phs000424.v8.p2. All processed GTEx data are available via the GTEx portal (http://gtexportal.org/). The download links to the mixQTL full summary statistics for 49 GTEx tissues are listed in Supplementary Data 1.

Code availability

Softwares mixQTL, mixFine, and mixPred in R https://github.com/hakyimlab/mixqtl¹⁵. A reproducible pipeline for the simulated data and some GTEx data analysis https://github.com/liangyy/mixqtl-pipeline¹⁶. A reproducible pipeline for the massive GTEx data analysis https://github.com/liangyy/mixqtl-gtex¹⁷. A GPU-based implementation embedded in tensorQTL https://github.com/broadinstitute/tensorqtl¹⁸.

References

Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091 (2015).
Article CAS Google Scholar
Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48, 245 (2016).
Article CAS Google Scholar
Zhu, Z. et al. Integration of summary data from gwas and eqtl studies predicts complex trait gene targets. Nat. Genet. 48, 481 (2016).
Article CAS Google Scholar
Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 10, e1004383 (2014).
Article Google Scholar
Hormozdiari, F. et al. Colocalization of gwas and eqtl signals detects target genes. Am. J. Hum. Genet. 99, 1245–1260 (2016).
Article CAS Google Scholar
Wen, X., Pique-Regi, R. & Luca, F. Integrating molecular qtl data into genome-wide genetic association analysis: Probabilistic assessment of enrichment and colocalization. PLoS Genet. 13, e1006646 (2017).
Article Google Scholar
Sun, W. A statistical framework for eqtl mapping using rna-seq data. Biometrics 68, 1–11 (2012).
Article MathSciNet Google Scholar
Van De Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J. K. Wasp: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061 (2015).
Article Google Scholar
Kumasaka, N., Knights, A. J. & Gaffney, D. J. Fine-mapping cellular qtls with rasqual and atac-seq. Nat. Genet. 48, 206 (2016).
Article CAS Google Scholar
Zou, J. et al. Leveraging allelic imbalance to refine fine-mapping for eQTL studies. PLoS Genet. 15, e1008481 (2019).
Article Google Scholar
Wang, A. T. et al. Allele-specific qtl fine mapping with plasma. Am. J. Hum. Genet. 106, 170–187 (2020).
Article CAS Google Scholar
Mohammadi, P., Castel, S. E., Brown, A. A. & Lappalainen, T. Quantifying the regulatory effect size of cis-acting genetic variation using allelic fold change. Genome Res. 27, 1872–1884 (2017).
Article CAS Google Scholar
The GTEx Consortium. The gtex consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Võsa, U. et al. Unraveling the polygenic architecture of complex traits using blood eqtl meta-analysis. bioRxiv https://www.biorxiv.org/content/10.1101/447367v1 (2018).
Liang, Y., Barbeira, A. N. & Im, H. K. A scalable unified framework of total and allele-specific counts for cis-qtl, fine-mapping, and prediction. mixqtl r package. https://doi.org/10.5281/zenodo.4443162 (2021).
Liang, Y. A scalable unified framework of total and allele-specific counts for cis-qtl, fine-mapping, and prediction. data analysis pipeline for mixqtl r package. https://doi.org/10.5281/zenodo.4443205 (2021).
Liang, Y. A scalable unified framework of total and allele-specific counts for cis-qtl, fine-mapping, and prediction. data analysis pipeline of for gtex v8 data. https://doi.org/10.5281/zenodo.4443217 (2021).
Aguet, F. & Taylor-Weiner, A. Scaling computational genomics to millions of individuals with gpus. tensorqtl https://github.com/broadinstitute/tensorqtl (2020).
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. 82, 1273–1300 (2020).
Article MathSciNet Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1 (2010).
Article Google Scholar
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Moore, J. E. et al. Expanded encyclopaedias of dna elements in the human and mouse genomes. Nature 583, 699–710 (2020).
Article ADS Google Scholar
Shabalin, A. A. Matrix eqtl: ultra fast eqtl analysis via large matrix operations. Bioinformatics 28, 1353–1358 (2012).
Article CAS Google Scholar
Ongen, H., Buil, A., Brown, A. A., Dermitzakis, E. T. & Delaneau, O. Fast and efficient qtl mapper for thousands of molecular phenotypes. Bioinformatics 32, 1479–1485 (2015).
Article Google Scholar
Taylor-Weiner, A. et al. Scaling computational genomics to millions of individuals with gpus. Genome Biol. 20, 1–5 (2019).
Article Google Scholar
Evangelou, E. & Ioannidis, J. P. Meta-analysis methods for genome-wide association studies and beyond. Nat. Rev. Genet. 14, 379–389 (2013).
Article CAS Google Scholar
Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).
Article Google Scholar
Stegle, O., Parts, L., Durbin, R. & Winn, J. A bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eqtl studies. PLoS Comput. Biol. 6, e1000770 (2010).
Article ADS MathSciNet Google Scholar

Download references

Acknowledgements

We thank G. Saideep for helpful discussions on running RASQUAL. H.K.I., Y.L., and A.N.B. were partially funded by R01MH10766 and P30 DK20595 (Diabetes Research and Training Center).

Author information

Authors and Affiliations

Section of Genetic Medicine, The University of Chicago, Chicago, IL, USA
Yanyu Liang, Alvaro N. Barbeira & Hae Kyung Im
The Broad Institute of MIT and Harvard, Cambridge, MA, USA
François Aguet & Kristin Ardlie

Authors

Yanyu Liang
View author publications
You can also search for this author in PubMed Google Scholar
François Aguet
View author publications
You can also search for this author in PubMed Google Scholar
Alvaro N. Barbeira
View author publications
You can also search for this author in PubMed Google Scholar
Kristin Ardlie
View author publications
You can also search for this author in PubMed Google Scholar
Hae Kyung Im
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.L. and H.K.I. conceived and designed the work. Y.L. created the software for mixQTL, mixFine, and mixPred. A.N.B. contributed to software development and documentation. F.A. implementated mixQTL in tensorQTL. F.A. preprocessed GTEx RNA-seq data. K.A. supervised the processing of the data. Y.L. performed most of the analysis and wrote the original draft of the manuscript. F.A. and H.K.I. extensively edited the manuscript. H.K.I. supervised the whole project. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yanyu Liang or Hae Kyung Im.

Ethics declarations

Competing interests

F.A. is an inventor on a patent application related to TensorQTL. H.K.I. has received speaker honoraria from GSK and AbbVie. Y.L., A.N.B., and K.A. declare no competing interests.

Additional information

Peer review information Nature Communications thanks Pejman Mohammadi, and the other, anonymous, reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review

Reporting Summary

Description of Additional Supplementary Fiies

Supplementary Data 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liang, Y., Aguet, F., Barbeira, A.N. et al. A scalable unified framework of total and allele-specific counts for cis-QTL, fine-mapping, and prediction. Nat Commun 12, 1424 (2021). https://doi.org/10.1038/s41467-021-21592-8

Download citation

Received: 04 May 2020
Accepted: 29 January 2021
Published: 03 March 2021
DOI: https://doi.org/10.1038/s41467-021-21592-8

This article is cited by

Haplotype-aware modeling of cis-regulatory effects highlights the gaps remaining in eQTL data
- Nava Ehsan
- Bence M. Kotis
- Pejman Mohammadi
Nature Communications (2024)
MESuSiE enables scalable and powerful multi-ancestry fine-mapping of causal variants in genome-wide association studies
- Boran Gao
- Xiang Zhou
Nature Genetics (2024)
SEESAW: detecting isoform-level allelic imbalance accounting for inferential uncertainty
- Euphy Y. Wu
- Noor P. Singh
- Michael I. Love
Genome Biology (2023)
Molecular quantitative trait loci
- François Aguet
- Kaur Alasoo
- Tuuli Lappalainen
Nature Reviews Methods Primers (2023)
QTL Mapping and Genetic Map for the Ornamental Sunflower in China
- Jixia Liu
- Junjian Shan
- Ping Wang
Plant Molecular Biology Reporter (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.