Abstract
Mapping cell typespecific gene expression quantitative trait loci (cteQTLs) is a powerful way to investigate the genetic basis of complex traits. A popular method for cteQTL mapping is to assess the interaction between the genotype of a genetic locus and the abundance of a specific cell type using a linear model. However, this approach requires transforming RNAseq count data, which distorts the relation between gene expression and cell type proportions and results in reduced power and/or inflated type I error. To address this issue, we have developed a statistical method called CSeQTL that allows for cteQTL mapping using bulk RNAseq count data while taking advantage of allelespecific expression. We validated the results of CSeQTL through simulations and real data analysis, comparing CSeQTL results to those obtained from purified bulk RNAseq data or single cell RNAseq data. Using our cteQTL findings, we were able to identify cell types relevant to 21 categories of human traits.
Introduction
Studying the variation of gene expression is essential for understanding cellular and molecular biology. Gene expression can vary significantly across different cell types, and that the composition of cell types can vary across tissue samples^{1}. As a result, variation in gene expression observed in bulk tissue samples can be due to both cell typespecific expression and variations in cell type compositions^{2}. Investigating gene expression quantitative trait loci (eQTLs), or genetic variants associated with gene expression, is a powerful approach for studying the genetic basis of complex traits^{3,4}. Several recent studies found that many genetic loci implicated in human diseases are associated with certain cell types^{5,6,7,8}. By studying cell typespecific eQTLs (cteQTLs), we can gain further insights into the genetic basis of complex traits^{9,10,11}.
A popular method to study cteQTLs using bulk tissue gene expression data is to include an interaction between the genotype at a genetic locus and the abundance of a cell type in a linear model^{3,4,11,12,13,14,15,16,17}. However, linear models require the residual variation in gene expression to be constant across samples. Therefore, it is often necessary to use a log transformation or normal quantile transformation of RNAseq count data to stabilize variance. These transformations can result in nonlinear relationships between transformed gene expression and cell type proportions, leading to a misspecified linear model. An alternative and more appropriate modeling approach is to use negative binomial regression to directly model RNAseq count data. In addition to total read count (TReC), RNAseq data can also provide information about allelespecific expression (ASE). By incorporating both TReC and ASE, it is possible to increase the power of eQTL mapping by taking advantage of the allelic imbalance of gene expression caused by cisacting eQTLs. We have developed a method called TReCASE that uses this approach^{18,19}. Most local eQTLs are ciseQTLs, and the terms are often used synonymously. In this paper, we use the term “ciseQTL” to refer specifically to cisacting eQTLs that lead to allelic imbalance.
We have previously developed a method called pTReCASE for eQTL mapping using RNAseq data from tumor samples, where we treated tumor and nontumor cells as two distinct cell types with known composition^{20}. However, this approach is limited to situations where there are only two cell types and where cell type proportions vary significantly across samples. It treats bulk TReC as the sum of TReC from the two cell types. In more general situations with an arbitrary number of cell types, there are several challenges to cteQTL mapping. For example, a cell type may have nearly constant proportions across samples, which can make it difficult to accurately estimate the cteQTL effect for that cell type. Additionally, a gene’s expression may be zero or very low in some cell types, making it difficult or impossible to estimate eQTL effects in those cell types. In this paper, we have designed a flexible and robust computational framework from scratch to handle these challenges.
Although single cell RNAseq (scRNAseq) data have become more widely available and can be used to study cteQTLs, there are still some limitations. First, scRNAseq is expensive for studies with large sample sizes, and it also requires high quality samples. Additionally, scRNAseq may not provide a representative sampling of all cell types in a tissue sample, and the inherent sparsity of scRNAseq data can make it difficult to accurately assign cell types to individual cells. Our method allows for a new study design: collecting scRNAseq data from a subset of samples, along with bulk RNAseq data from all samples. The scRNAseq data can be used to create a cell typespecific gene expression reference, and then the bulk RNAseq data can be used for cteQTL mapping after estimating cell type proportions using the reference.
Results
A brief introduction of CSeQTL and OLS method
Our method is called cell typespecific eQTL or CSeQTL for short. CSeQTL jointly models total read count (TReC) and allelespecific read count (ASReC) as a function of covariates, cell type composition, and the genotype at a single nucleotide polymorphism (SNP). More specifically, TReC and ASReC are modeled by a negative binomial and a betabinomial distribution, respectively, with shared parameters for genetic effects^{18}. Unlike the TReCASE and pTReCASE methods, CSeQTL is designed to handle challenging situations where cell typespecific gene expression may be zero or very low, or the proportion of one or more cell types may be close to zero or lack variation. These challenges can make it difficult or impossible to accurately estimate eQTL effects. We address these issues by using several computational solutions, including trimming outliers of TReC to increase the robustness of our estimates and iteratively detecting and removing nonexpressed cell types.
We compare CSeQTL to a linear model approach that we refer to as the ordinary least squares (OLS) method. To implement the OLS method, we first apply an inverse normal quantile transformation to readdepth normalized TReC for each gene. Next, we define a reference cell type (usually the one with the highest average abundance) and fit a linear model with transformed gene expression as the dependent variable and the following covariates as independent variables: the proportions of all cell types except the reference cell type, the genotype at a SNP, and the interactions between the SNP genotype and the proportion of each nonreference cell type. Other covariates such as age, sex, and batch can also be included. With this model setup, the cteQTL effect for the reference cell type is the main effect of the SNP genotype, and the cteQTL effect for a nonreference cell type is the sum of the genotype’s main effect and the effect size of the corresponding interaction term. This model is the same as the one used by AguirreGamboa et al.^{16}.
CSeQTL controls type I error and has much higher power than OLS
We conducted simulations to evaluate type I error and power of CSeQTL in a variety of settings. First, we varied the baseline expression (i.e., gene expression of the reference allele) across cell types. Second, we considered three scenarios of cell type composition variation for three cell types, referred to as CT1, CT2, and CT3 (Fig. 1a). In scenario 1, cell type proportions were generated independently and identically distributed, and then normalized to sum to one. This scenario represents an ideal, but unrealistic, situation. In scenario 2, we created more realistic cell type proportions by setting the average abundance of CT3 to be lower than CT1 and CT2, and by reducing the variance in the proportion of CT3. This scenario represents a more difficult situation for cteQTL mapping of CT3. In scenario 3, we added outlier proportions to the simulated proportions of scenario 2 to mimic observations in real data. We also conducted a secondary set of simulations to explore the performance of CSeQTL given noisy estimates of cell type proportions (Supplementary Fig. 1).
We set the mean expression of the reference allele in CT1 be 500. For example, if the reference/alternative allele is A/T, then the mean expression in an individual with genotype AA is 1000. We set the ASReC to be 5% of the TReC. We also set the fold change for the reference allele gene expression of CT2 or CT3 vs. CT1 to be 0.1, 1.0, or 10. Following the TReCASE model, TReC and ASReC were simulated conditional on phased SNP genotypes, cell type proportions, expected expressions per allele and cell type, and other covariates. The sample size was 300. All the eQTLs were set to be ciseQTLs that influenced both TReC and ASReC and we specified eQTL effect size by fold change of alternative allele B vs. reference allele A. In a global null situation, all cteQTL effects were set to be 1.0 (Fig. 1b). In another mixed null/alternative situation, we allowed CT1’s eQTL effect to vary from \(\exp (1)\) to \(\exp (1)\), set the eQTL effect for CT2 to be 1 (i.e., no eQTL effect), and set the eQTL effect for CT3 to be 1.5. This design allowed us to assess power in CT1 and CT3 and type I error in CT2 simultaneously (Fig. 1c).
Under both global null and mixed null/alternative situations, CSeQTL controls type I error but OLS has apparent type I error inflation in several configurations (Fig. 1b, c). Focusing on the mixed null/alternative situation, we found that under scenario 1, when the three cell types have the same distribution of proportions, CSeQTL generally has higher power than OLS. When the baseline expression of CT3 is low (0.1 fold of CT1), OLS’s power in CT3 is positively correlated with CT1’s eQTL effect size even though CT3 has a constant effect size throughout. This “leaking” of eQTL effect from CT1 to CT3 is likely due to the transformation of gene expression. OLS also suffers from inflated type I error (i.e., eQTL findings from CT2) in cases where CT2 has lower baseline expression, highlighting the difficulty in estimating eQTL effects when cell typespecific gene expression levels are low.
In scenario 2, where CT1 has the highest proportion and CT3 has the lowest proportion, power to detect cteQTLs is reduced across models and cell types when compared with scenario 1. CSeQTL’s power to detect CT1 eQTLs is much higher than OLS. CT3 eQTLs are detectable by either method if its baseline expression is high and in that case (2nd row and 2nd column of Fig. 1c) CSeQTL has much higher power than OLS, e.g., >80% power by CSeQTL vs. <20% power by OLS. OLS still has type I error inflation for CT2, to a smaller degree than in scenario 1. Finally in scenario 3, the introduction of outliers in cell type proportions substantially increases the type I error inflation of OLS for CT2 when baseline expression of CT2 is low. Additional simulation results, including results using a noisy version of cell type proportions, are presented in Supplementary Figs. 2–5.
Our implementation allows for the trimming of outliers whose Cook’s distance is larger than a threshold, following the approach used by DESeq2^{21}. The Cooks’ distance is calculated based on the null model (no eQTL), and the value of outliers are imputed with the null model’s predicted outcome (see “Methods” section “Trimming influential counts” for more details). This trimming procedure may slightly reduce power, but helps to guard against type I error. The impact of trimming is more apparent in one dataset (GTEx brain samples) in our real data analysis, which we discuss in the next section.
In summary, power to detect cteQTLs is driven by the model and positively correlated with eQTL effect size, absolute and relative reference allele expression, and variability in cell type proportions.
CSeQTL identifies many more cteQTLs than OLS in human brain and blood
We analyzed bulk RNAseq data from three sources: 670 whole blood samples from the Genotype Tissue Expression (GTEx) project^{3}, 254 schizophrenia patients and 283 controls from the CommonMind Consortium (CMC)^{22,23}, and 175 brain samples from GTEx. Additionally, we studied cell typepurified bulk RNAseq data from the BLUEPRINT cohort, including purified CD4+ T cells (n = 212), monocytes (n = 197), and neutrophils (n = 205). For the purified bulk RNAseq data, CSeQTL was equivalent to TReCASE, and the results were used to validate cteQTL results from GTEx whole blood samples.
We obtained phased genotypes, TReC, ASReC, observed covariates, and latent batch covariates for each of the four cohorts (GTEx whole blood, CMC brain, GTEx brain, and BLUEPRINT). See Supplementary Note 4 for more information. Using ICeDT^{24}, we estimated cell type proportions based on TReC and cell typespecific reference data for 5 brain cell types^{25} and 22 blood cell types^{26}. See Supplementary Note 2 for more details.
We found that the distributions of cell type proportions were similar between schizophrenia patients and healthy controls in CMC and GTEx brain samples (Fig. 2a). Excitatory neurons (Exc) had the highest proportions, followed by astrocytes (Astro), inhibitory neurons (Inh), oligodendrocytes (Oligo), and oligodendrocyte precursor cells (OPC). Microglia had the lowest proportions and the smallest variation, making it difficult to detect cteQTLs in this cell type. For the 22 blood cell types^{26}, we collapsed them into seven cell types due to limited prevalence and variability in some cell types (Supplementary Fig. 8 and “Methods” section “Grouping 22 blood cell types to seven cell types”). In GTEx whole blood samples, neutrophils were the dominant cell type with the highest proportions and the largest variance (Fig. 2b).
We conducted both cteQTL mapping and traditional bulk eQTL mapping, which assesses aggregated eQTL effects in bulk tissue samples. For both cteQTL and bulk eQTL mapping, we included a set of covariates: library size, observed covariates (such as age, sex, and known batch effects), and genotype principal components (PCs). We also added latent factors estimated from gene expression data, which were calculated by PCs of residualized gene expression data after accounting for all the aforementioned covariates. We obtained two sets of residualized TReC PCs: the first set was generated by residuals that did not account for cell type proportions and the second set did. The first set was used for bulk eQTL mapping, to mimic the common practice of eQTL mapping. The second set was used in cteQTL mapping. When using OLS for cteQTL mapping, we included cell type proportions and interaction terms between genotype and cell type proportions. We excluded genes with low expression in most samples (75th percentile of TReC <50) and SNPs with minor allele frequencies below 5% from our analysis. We considered SNPs located between 50 kilobases before the transcription start site and 50 kilobases after the transcription end site, including those within the gene body.
We trimmed expression outliers of each gene using Cook’s distance. To determine the appropriate threshold of Cook’s distance for each dataset, we ran TReConly eQTL mapping using permuted data for all the genes on chromosome 1 with thresholds of 10, 15, 20, or no trimming. We selected the threshold for each dataset to ensure that type I error was controlled per cell type. The selected thresholds were 10 for GTEx brain data and 20 for the other three cohorts. A more aggressive trimming threshold was needed for GTEx brain data, likely due to its smaller sample size.
For eQTL mapping, we need to account for two layers of multiple testing: (1) testing across multiple local SNPs per gene and (2) testing across genes. For each gene, we assessed the significance of its minimum p value across all local SNPs by calculating the corresponding permutation p value. A bruteforce implementation, which involves permuting the data many times and running CSeQTL on each permuted dataset, is computationally prohibitive. Instead, we used a computationally efficient method called geoP^{19,27} to calculate a permutation p value by estimating the effective number of independent tests. After this step, each gene has one permutation p value. To account for multiple testing across genes, we selected a permutation p value cutoff to control false discovery rate (FDR) quantified by qvalue (Supplementary Note 3). We calculated a qvalue^{28} for each permutation p value cutoff and chose a qvalue cutoff 0.005 by default. This cutoff was smaller than typical FDR cutoff (e.g., 0.05) because the calculation of qvalue accounted for the proportion of nulls, which could lead to a liberal permutation p value cutoff when the proportion of nulls was small. For bulk eQTL results, a qvalue of 0.005 corresponds to permutation p value around 0.02 while a qvalue of 0.05 may correspond to a permutation p value larger than 0.1 (Supplementary Tables 7–10). We applied this twostep multiple testing correction procedure to both bulk eQTL mapping and cteQTL mapping for each cell type. A similar procedure has been used in GTEx studies^{3}.
When performing bulk eQTL mapping or cteQTL mapping using cell typepurified samples from BLUEPRINT, CSeQTL is equivalent to the TReCASE method^{29}. Consistent with our previous results^{19,29}, CSeQTL has much higher power than OLS. For example, considering the results for CMC schizophrenia samples (n = 250), for bulk eQTL mapping after trimming outliers, CSeQTL and OLS identified around 6900 and 2900 eGenes (genes with at least one significant eQTL) respectively (Supplementary Table 2). Similar results were observed for BLUEPRINT data from purified blood cell types (Supplementary Table 4 and Supplementary Figs. 17 and 18).
CSeQTL identified many more cteQTLs than OLS for different brain cell types. After trimming, CSeQTL identified hundreds to thousands of cteQTLs per cell type in CMC schizophrenia data and OLS only identified two eQTLs in oligodendrocytes (Fig. 2c and Supplementary Table 2). The results were similar for CMC control samples (n = 275) and trimming outliers did not have a large impact (Supplementary Table 2). In contrast, trimming outliers had a large effect for GTEx brain data, which had a relatively small sample size of 174. In particular, for microglia, the cell type with the lowest abundance, CSeQTL and OLS identified 885 and 184 eGenes before trimming, but only 96 and 0 eGenes after trimming (Supplementary Fig. 14 and Supplementary Table 3). The results from GTEx brain data suggest that CSeQTL still has much higher power than OLS when sample size is small, but should be used with caution.
For blood cteQTLs estimated by GTEx whole blood data, OLS identified 1014 eGenes in neutrophil, and two or zero eGenes in other cell types. In contrast, CSeQTL identified >4000 eGenes in neutrophil, including most findings by OLS (Supplementary Fig. 21) and hundreds of eGenes in other cell types (Fig. 2d and Supplementary Table 5).
CSeQTL results demonstrated limited eGene overlaps across cell types (Supplementary Figs. 12, 15 and 20), though the majority of cteQTLs overlap with the eQTLs detected by bulk eQTL mapping. These results suggest cteQTL signals may be detectable from bulk tissue samples, though without knowing the relevant cell types. Very low consistency between cteQTLs and bulk eQTLs may indicate false discoveries in cteQTLs. For example, for the GTEx brain study, before trimming, OLS identified 1332 eQTLs in microglia for 184 eGenes, while only <0.01% overlap with bulk eQTLs, and none of these 1332 eQTLs remained significant after trimming outliers (Supplementary Table 3). In all comparisons hereafter, we focused on the eQTL results after trimming outliers since earlier results demonstrated it could reduce the number of false positives.
CSeQTL findings have significant overlaps with cteQTLs identified by purified bulk RNAseq data or scRNAseq data
We validated the CSeQTL findings from GTEx whole blood using the eQTLs identified from purified bulk RNAseq data of three cell types—CD4T, monocyte, and neutrophils—from the BLUEPRINT project^{30}. A large number of eGenes were identified from BLUEPRINT data and the number was imbalanced across cell types (Supplementary Table 6). In order to make a meaningful comparison, we compared the CSeQTL findings to the top 500 eGenes (<5% of all genes considered by BLUEPRINT) for each of the three BLUEPRINT cell types. At a qvalue cutoff of 0.005 for any fold changes, around 35%, 17%, and 12% of CSeQTL eGenes from neutrophil, CD4T, and monocytes overlapped with the top 500 BLUEPRINT eGenes, respectively. These proportions increased to 40%, 30%, and 20% for qvalue < 0.001 and fold change ≥1.5 (Fig. 3a, b). The numbers of overlaps were 5.7–8.8 times of the numbers expected by chance (Fig. 3c). Higher overlapping proportion in neutrophil was expected because it was the most abundant cell type and CSeQTL had higher power for the more abundant cell type.
We also compared the CSeQTL results from GTEx whole blood with the cteQTLs identified from a large scRNAseq dataset^{31} from peripheral blood mononuclear cells (PBMCs) of 982 donors, with an average of 1291 cells per donor. Yazar et al.^{31} studied cteQTLs in 14 types of immune cells. We removed two cell types with very low proportions and very small number of eGenes. The remaining 12 cell types were collapsed to five categories: B, CD4T, CD8T, Monocyte, and NK (Natural Killer), matching the cell types studied in GTEx whole blood data. This was a challenging comparison because the five cell types had small proportions in whole blood samples, where the most abundant cell type was neutrophil (Fig. 2a, b). Nevertheless, we found highly significant overlaps between CSeQTL eGenes and Yazar et al. eGenes, with fold change enrichments ranging from 4.1 to 6.7 (Fig. 3d–f). The fact that more stringent criteria to select cteQTLs lead to larger overlap proportions (Fig. 3a, d) suggests that our quantification of cteQTL effect sizes and significance levels is useful to select stronger cteQTLs. CD4T is the most abundant cell type studied by Yazar et al.^{31}, though the replication percentage is lower than most other cell types, likely due to two reasons. First, its proportion is low in whole blood samples (Fig. 2b). Second, similarity between CD4T cells and other cell types, such as CD8T cells, may lead to reduced accuracy of cell type deconvolution in bulk RNAseq data as well as cell type classification in scRNAseq data. Another important criterion to evaluate eQTL findings is the consistency of eQTL effect directions. We examined the cteQTLs that were identified by both CSeQTL (using a qvalue cutoff 0.1 or 0.005) and scRNAseq data (p value < 0.01), and found the eQTL directions were consistent for more than 90% of cteQTL findings across most cell types (Fig. 3g). In contrast, without applying any qvalue/p value filtering the consistency proportion is 51% (Supplementary Table 11).
For the cteQTLs identified from brain samples (GTEx brain, CMC schizophrenia patients or controls), we compared with cteQTLs reported by a single nucleus RNAseq (snRNAseq) study^{32}. Bryois et al.^{32} collected snRNAseq data for 6940 to 14,595 genes in 144 to 192 individuals for eight major brain cell types: excitatory neurons, inhibitory neurons, astrocytes, microglia, oligodendrocytes, oligodendrocyte precursor cells (OPCs), Endothelial, and Pericytes. Both Pericytes and Endothelial had very small number of cells and cteQTLs, and thus we skipped them in our comparison. The remaining six cell types were exactly the same as the cell types considered in our CSeQTL analysis. Overall the results were consistent with the findings for immune cell types. The CSeQTL eGenes had significant overlap with the top eGenes reported by Bryois et al. (Supplementary Fig. 22). Though the overlap was low for two cell types: inhibitory neurons and microglia, likely due to low proportions of these two cell types. In addition, cell typespecific expression were similar between excitatory neurons and inhibitory neurons (Supplementary Fig. 9), which could further increase the difficulty to map cteQTLs for inhibitory neurons. The eQTL effect direction estimates by CSeQTL and snRNAseq were highly consistent for most cell types except for inhibitory neurons, again suggesting that cteQTL mapping was challenging for this cell type (Fig. 3h and Supplementary Tables 12–14).
We further compared the cteQTLs identified only by scRNAseq/snRNAseq data or only by CSeQTL on bulk RNAseq data. The scRNAseqonly cteQTLs tended to have smaller effect sizes and larger p values. Therefore CSeQTL may have missed those scRNAseqonly cteQTLs because of their weaker effects (Supplementary Fig. 23). CSeQTL combines deconvolution of gene expression and eQTL mapping into one step which accounts for the uncertainty of deconvolution. Therefore, the power of CSeQTL is impacted by both the uncertainty of gene expression deconvolution and the magnitude of the cteQTLs. The cteQTLs identified solely by CSeQTL tended to have smaller effect sizes and higher expression levels in bulk samples (Supplementary Fig. 24). This makes sense because genes with higher expression have smaller uncertainty in gene expression deconvolution. Higher gene expression level should also increase the power of eQTL mapping using scRNAseq data, though its effect could be more pronounced for CSeQTL as it also improves the accuracy of cell type deconvolution.
Characterization of cteQTLs
We have analyzed two blood RNAseq datasets (BLUEPRINT and GTEx whole blood) and three brain RNAseq datasets (CMC schizophrenia (SCZ), CMC control, and GTEx brain). It is interesting to study the consistency of eQTL results across datasets. Overall, CSeQTL results showed a higher level of consistency than OLS results (Supplementary Fig. 25 and Supplementary Table 6). For whole blood, a higher consistency was observed for neutrophil, likely due to its higher abundance. For brain datasets, CMCSCZ and CMCControl showed higher levels of consistency than between CMC dataset and GTEx brain, likely due to batch effects between the two studies.
We summarized the locations of the minimum p value SNPs (minPSNPs) relative to the corresponding eGenes (Supplementary Figs. 11, 13, 16 and 19). In the brain datasets, The locations of minPSNPs from bulk eQTL mapping showed enrichment around the transcription start site (TSS) or transcription end site (TES), though such patterns were not as clear in cteQTLs. A potential reason was that the eQTLs around TSS and TES were more likely to be shared across cell types. In GTEx brain results, more cteQTLs tended to be located further away from the corresponding eGenes, which might be due to the limited sample size hence higher uncertainty to locate the eQTLs. For the three purified cell types from BLUEPRINT (Supplementary Fig. 16), the enrichment of eQTLs around TSS was stronger than TES. Similar patterns of eQTL locations were observed for the same three cell types in GTEx whole blood samples (Supplementary Fig. 19).
Next we focused on CSeQTL results and evaluated the distribution of cteQTLs with respect to functional annotations of genomic regions (e.g., enhancers, promoters, 3’ UTR, 5’ UTR, etc.) by Torus^{33} (Supplementary Figs. 26 and 27). For brain tissues, the functional enrichment of eQTLs for excitatory neuron, which was the most abundant cell type, was similar to the functional enrichment of bulk eQTLs. The lack of significant functional enrichment in other cell types could be partially due to smaller number of cteQTLs. Comparing brain samples of schizophrenia patients vs. controls (either CMC controls and GTEx controls), enrichment of eQTLs at 5’ UTR and noncoding (NC) transcript were observed in both control groups but were absent in schizophrenia patients. Since neutrophil was the dominant cell type in whole blood, as expected, functional enrichment in neutrophil and whole blood was highly consistent. Despite the small proportions of CD4+ T and monocyte in whole blood, CSeQTL results recovered similar functional enrichment as those observed in purified cell types from BLUEPRINT data.
CSeQTL helps interpret GWAS findings
EQTLs are often used to study the genetic basis of complex traits by examining their overlap with genetic loci identified from genomewide association studies (GWAS). Here we systematically evaluated the overlap between cteQTLs and GWAS hits of either all the traits included in the GWAS catalog^{34} on 21 categories of traits (Fig. 4 and Supplementary Fig. 5). We calculated the enrichment of eQTLs among GWAS hits by a log fold change (the proportion of GWAS hits that overlap with eQTLs vs. the proportion of genetic loci being eQTLs). See section C.3.3 of Vasyl et al.^{19} for details on the computation of point estimates and their confidence intervals.
GWAS hits of several categories (e.g., education/wealth) were enriched in the bulk eQTLs of all three brain datasets, though the degree of enrichment (measured by log fold change in Fig. 4) was small. When considering cteQTLs by CSeQTL, due to the smaller number of eQTLs, the confidence to estimate enrichment was often low, which led to wider confidence intervals. Despite such limitation, we observed several interesting findings. For example, the GWAS hits of immune traits were enriched in the cteQTLs for microglia in CMC controls and GTEx brain samples, but not in CMC SCZ samples, suggesting potential SCZspecific and cteQTL signals.
Blood is arguably the most accessible tissue and thus molecular biomarkers (e.g., cell typespecific gene expression) in blood can be very valuable to understand the mechanism that connects genetic variants and complex traits. Our cteQTL results provided a useful resource for such studies (Fig. 5). For example, enrichment of respiratory and skin disease GWAS signals among B cell specific eQTLs, and the association between liver disease GWAS hits and the eQTLs in CD8+ T cells. Earlier studies have reported that CD8+ T cells were associated with liver damage, hepatitis, immunopathology, and liver cancer^{35,36,37,38,39}.
Discussion
CSeQTL’s framework allows mapping cteQTLs using bulk RNAseq data, by jointly modeling the effects of cell type composition and cteQTLs. We have shown by simulations and real data analyses that CSeQTL can have substantially higher power than a linear regression approach, while still maintaining type I error control. This is due to the underlying statistical model of CSeQTL. Deconvolution of gene expression to individual cell types should be performed using untransformed count data^{40}, while eQTL mapping is often done using transformed gene expression data (e.g., normal quantile transformation) to avoid the impact of outliers in count data. Such outliers are often due to a strong positive associations between mean value and variance of count data. Our CSeQTL method satisfies these two restrictions by directly modeling count data using a negative binomial distribution that accounts for the strong meanvariance dependence. In contrast, the linear regression approach uses transformed gene expression data and models cteQTLs by adding interactions between cell type compositions and genotypes. The transformation of gene expression data distorts their associations with cell type compositions and thus can reduce power and inflate type I error. In addition, we also include allelespecific expression in our model to boost the power to detect cteQTLs.
Model optimization of CSeQTL is very challenging because the model may not be identifiable, for example, due to a lack of variation of one cell type’s abundance across individuals or very low expression of one gene in one or more cell types. A naive implementation may result in suboptimal solutions due to noninvertible observed information matrices, negative variances, or extreme parameter estimates, which can have a profound impact on hypothesis testing. We have developed a comprehensive set of assessments to ensure a robust and optimal solution is obtained. In addition, both linear regression and CSeQTL can be sensitive to outliers, and we addressed this issue by trimming those outliers based on the null model without eQTL effects. As shown in our real data analyses, such trimming can be particularly helpful for studies with smaller sample sizes.
Our applications toward human brain and blood bulk RNAseq data demonstrate that the linear regression method often identifies none or a few cteGenes (with the exception of neutrophil in whole blood) while CSeQTL can identify hundreds to thousands of cell typespecific eGenes. When examining the overlap between these cteQTLs and GWAS findings, we have identified several interesting results but with high uncertainty in many cases. Future independent studies and comparisons with larger sample sizes may be needed to reach more definite conclusions.
A limitation when applying our method or any cteQTL mapping method on bulk RNAseq data is accurate estimation of cell type composition, which in turn requires accurate cell typespecific gene expression reference. Here we have applied our method on the bulk RNAseq data from brain and blood because these two tissues have readily available cell typespecific gene expression reference. We expect that in the near future, with the advance of the human cell atlas^{1} or other similar projects, such resources will become available in more tissue types.
Our work also enables a new type of study design to jointly model scRNAseq and bulk RNAseq data to study cteQTLs. For example, scRNAseq data can be collected in a small number of individuals, to be used as reference for cell typespecific expression. In addition, scRNAseq data can also be used for eQTL mapping. After clustering and identification of cell types, scRNAseq data can be converted to pseudobulk data of individual cell types and be used for eQTL mapping, e.g., by applying our TReCASE method^{29}. The likelihood function of the TReCASE model can be combined with CSeQTL model in order to combine bulk RNAseq and scRNAseq data for cteQTL mapping. Adding scRNAseq data to bulk RNAseq data can alleviate some challenges when using bulk RNA data, e.g., limited variability in cell type abundance for one cell type. Adding bulk RNAseq data to scRNAseq data can reduce the cost, increase sample size, and avoid distortion of gene expression in the process of isolating single cells.
Methods
Statistical models
Notations and the joint model of TReC and ASReC
Since our model is the same for any geneSNP pair, we omit gene and SNP indices to simplify notations. We use i and q as indices for sample and cell type, respectively, where i = 1, …, n, q = 1, …, Q, and n and Q denotes sample size and the number of cell types, respectively. Let T_{i} and N_{i} be the total read count (TReC) and the allelespecific read count (ASReC) mapped in the ith sample. Each SNP of interest has two alleles, A and B. Each gene has two haplotypes that are arbitrarily defined as haplotype 1 and 2. Let N_{i} − N_{i2} and N_{i2} denote the ASReC mapped to the first and second haplotypes of sample i, respectively.
Let Z_{i} denote the phased genotype for the SNP in sample i, which takes values AA, AB, BA, or BB. Let \({{{\mbox{}}}X{{\mbox{}}}}_{i}={({X}_{i1},\ldots,{X}_{ip})}^{{{{{{\rm{T}}}}}}}\) be a pvector of baseline covariates (excluding the intercept), where T denotes vector or matrix transpose. Among the baseline covariates in our model, we adjust for logtransformed read depth, defined as the log of the 75th percentile of a sample’s genelevel TReCs, a more robust measurement of readdepth than summing over all TReC values. Let ρ_{iq} denote the cell type proportion in the ith sample and qth cell type such that \({\sum }_{q=1}^{Q}{\rho }_{iq}=1\) and \({{{{{{\boldsymbol{\rho }}}}}}}_{i}={({\rho }_{i1},\ldots,{\rho }_{iQ})}^{{{{{{\rm{T}}}}}}}\). The cell type corresponding to q = 1 is referred to as the reference cell type. Our model is based on the following factorization:
Each factor is defined as follows:

\(P\left({T}_{i}{Z}_{i},{{{\mbox{}}}X{{\mbox{}}}}_{i},\,{{{{{{\boldsymbol{\rho }}}}}}}_{i}\right)\): given (Z_{i}, X_{i}, ρ_{i}), T_{i} is assumed to follow a negative binomial distribution with mean \({\mu }_{i}=E\left[{T}_{i}{Z}_{i},{{{\mbox{}}}X{{\mbox{}}}}_{i},\,{{{{{{\boldsymbol{\rho }}}}}}}_{i}\right]\) and dispersion parameter ϕ such that \(V\left[{T}_{i}{Z}_{i},{{{\mbox{}}}X{{\mbox{}}}}_{i},\,{{{{{{\boldsymbol{\rho }}}}}}}_{i}\right]={\mu }_{i}+\phi {\mu }_{i}^{2}\). This likelihood term corresponds to the TReC model.

\(P\left({N}_{i}{T}_{i},{Z}_{i},{{{\mbox{}}}X{{\mbox{}}}}_{i},\,{{{{{{\boldsymbol{\rho }}}}}}}_{i}\right)\): this term describes the total number of allelespecific reads as a function of TReC. It is determined by the number of heterozygous SNPs within the gene and is a constant with respect to the parameters of eQTLs. Thus it is factored out from the likelihood.

\(P\left({N}_{i2}{T}_{i},{N}_{i},{Z}_{i},X_{i},\,{{{{{{\boldsymbol{\rho }}}}}}}_{i}\right)\): given (N_{i}, Z_{i}, ρ_{i}), the read count N_{i2} is assumed to be independent of (T_{i}, X_{i}) and follows a betabinomial distribution with parameter π_{i}, which is the expected proportion of ASReC from the haplotype harboring the B allele for heterozygous samples among N_{i} allelespecific reads, and a dispersion parameter ψ. This likelihood term corresponds to the ASReC model.
The above likelihood framework is the same as our TReCASE method that combine TReC and ASE to map ciseQTLs^{20,29,41}. Similar to TReCASE, we reduce the negative binomial and betabinomial distribution to Poisson and binomial distribution, respectively, when the data does not support a nonzero overdispersion parameter. Next we describe how to extend each component of the likelihood function for cell typespecific eQTL mapping.
Let μ_{i,z,q} be the expected TReC for the ith sample, zth phased genotype, and qth cell type. We assume a multiplicative model \({\mu }_{i,z,q}={\mu }_{z,q}\exp \{X_{i}^{{{{{{\rm{T}}}}}}}{{{{{\boldsymbol{\beta }}}}}}\}\) and that the effect of baseline covariates β are the same for all cell types. We assume that the gene expression for each genotype is the summation of allelic expressions such that μ_{AA,q} = μ_{A,q} + μ_{A,q} = 2μ_{A,q}, μ_{AB,q} = μ_{A,q} + μ_{B,q} = μ_{BA,q}, and μ_{BB,q} = μ_{B,q} + μ_{B,q} = 2μ_{B,q} where μ_{A,q} and μ_{B,q} denote the expected TReC for A and B alleles of the qth cell type, respectively. Define κ_{q} = μ_{A,q}/μ_{A,1} and η_{q} = μ_{B,q}/μ_{A,q} where \({{\mbox{}}}\kappa {{\mbox{}}}={({\kappa }_{1},\ldots,{\kappa }_{Q})}^{{{{{{\rm{T}}}}}}}\) and \({{\mbox{}}}\eta {{\mbox{}}}={({\eta }_{1},\ldots,{\eta }_{Q})}^{{{{{{\rm{T}}}}}}}\). κ_{q}, which is a nuisance parameter, is the fold change of the A allele’s expression in the qth cell type vs. the first cell type. η_{q} is the eQTL effect size: the expression fold change of the B allele vs. A allele for the qth cell type.
Linear model
To establish a baseline for comparison and mimicking published analyses, we propose fitting a linear model by ordinary least squares (OLS). Let G_{i} denote the number of B alleles for a given phased SNP for the ith sample (i.e.,. G_{i} = 0, 1, 1, and 2 for Z_{i} = AA, AB, BA, and Z_{i} = BB, respectively). For a geneSNP pair, the cell typespecific linear model is:
where \({\bar{T}}_{i}\) is the inverse normal quantile transformation of readdepth adjusted T_{i}. The benefit of the transformation is guarding against outliers on the count scale, and it is a popular choice in eQTL studies^{3,42}. From the above model, we can test H_{0} : ζ_{g} + δ_{q} = 0 to assess the strength of cell typespecific eQTL for the qth cell type, where q = 2, …, Q, and test H_{0} : ζ_{g} = 0 for the reference cell type’s eQTL.
TReC model
Let \({\eta }_{q}^{({{{{{\rm{T}}}}}})}\) be the eQTL effect size for the TReC model, where the superscript ^{(T)} indicates TReC model. Given the above notations and parameters, let μ_{i,AA} be the expected TReC for the ith sample with AA genotype and it is defined such that:
With similar derivations for genotypes AB, BA, and BB, we have:
It is crucial to notice that \({\xi }_{i}^{({{{{{\rm{T}}}}}})}\) represents the bulk eQTL effect size for the ith sample. If eQTL effect is the same across cell types (\({\eta }_{1}^{({{{{{\rm{T}}}}}})}=\cdots={\eta }_{Q}^{({{{{{\rm{T}}}}}})}={\eta }^{({{{{{\rm{T}}}}}})}\)), \({\xi }_{i}^{({{{{{\rm{T}}}}}})}={\eta }^{({{{{{\rm{T}}}}}})}\) simplifying CSeQTL’s TReC model to the bulk TReC model presented by Sun^{29}. For Q = 2, CSeQTL’s TReC model would correspond to pTReCASE’s TReC model^{20}. After centering continuous covariates among X_{i} and setting categorical covariates among X_{i} to their reference level, the intercept term of the above model represents the logtransformed expected TReC of the reference cell type with genotype AA. This straightforward interpretation of CSeQTL is a crucial feature for model optimization and parameter estimate interpretation.
ASReC model
Let \({\eta }_{q}^{({{{{{\rm{A}}}}}})}\) be the eQTL effect size associated with the ASReC model. Let \({P}_{BB}\left({N}_{1}N;\pi ;\psi \right)\) be the betabinomial density for observing N_{1} successes among N trials with success probability π and overdispersion parameter ψ. For a geneSNP pair, the ASReC likelihood is defined as:
Similar to the TReC model:
which is the bulk eQTL effect size estimated from ASReC. If N_{i} = 0, the ASReC likelihood factors out of the joint model. Furthermore, while samples with genotypes AA and BB do not add information when estimating \({\eta }_{q}^{({{{{{\rm{A}}}}}})}\) and κ_{q}, they contribute toward estimating ψ.
cis/trans eQTL testing and eQTL testing
Following Sun^{29}, the modelspecific eQTL parameters \({\eta }_{q}^{({{{{{\rm{T}}}}}})}\) and \({\eta }_{q}^{({{{{{\rm{A}}}}}})}\) are used to formally characterize cis and trans eQTLs. By defining \({\eta }_{q}^{({{{{{\rm{A}}}}}})}={\eta }_{q}^{({{{{{\rm{T}}}}}})}{\alpha }_{q}\), the qth cell typespecific eQTL being cis corresponds to α_{q} = 1 and trans otherwise. For ciseQTLs, we use the joint model that combines TReC and ASReC/ASE models with shared cell typespecific parameter \({\eta }_{q}={\eta }_{q}^{({{{{{\rm{A}}}}}})}={\eta }_{q}^{({{{{{\rm{T}}}}}})}\). We conduct cis/trans testing per geneSNP pair and per cell type with H_{0} : α_{q} = 1 vs. H_{A} : α_{q} ≠ 1. Let \({{\mbox{}}}\alpha {{\mbox{}}}={\left({\alpha }_{1},\ldots,{\alpha }_{Q}\right)}^{{{{{{\rm{T}}}}}}}\).
EQTL significance testing is conducted for each gene, SNP, and cell type using either the TReC model for transeQTL with \({H}_{0}:{\eta }_{q}^{({{{{{\rm{T}}}}}})}=1\) vs. \({H}_{A}:{\eta }_{q}^{({{{{{\rm{T}}}}}})}\ne 1\) or the joint model for ciseQTL with H_{0} : η_{q} = 1 vs. H_{A} : η_{q} ≠ 1. Thus our model formulation is flexible enough to allow subsets of cell typespecific eQTLs to be cis or transeQTLs.
Optimization scheme and parameter assessment
Given cell type proportions, our optimization scheme for TReC and ASReC model fitting and hypothesis testing is based on the following procedure for a geneSNP pair. This scheme helps to avoid local optima since parameter estimation can be sensitive to initialization and influential counts. Let \({{\mbox{}}}\theta {{\mbox{}}}={({\mu }_{A,1},\,{{{{{{\boldsymbol{\beta }}}}}}}^{{{{{{\rm{T}}}}}}},\phi,{{{\mbox{}}}\kappa {{\mbox{}}}}^{{{{{{\rm{T}}}}}}},{{{\mbox{}}}\eta {{\mbox{}}}}^{{{{{{\rm{T}}}}}}},\psi,{{{\mbox{}}}\alpha {{\mbox{}}}}^{{{{{{\rm{T}}}}}}})}^{{{{{{\rm{T}}}}}}}\) denote the preestablished set of unconstrained parameters to optimize over. First, we condition T_{i} on X_{i} to obtain \({\widehat{{{\mbox{}}}\theta {{\mbox{}}}}}_{1}={({\widehat{\mu }}_{A,1},\,{\widehat{{{{{{\boldsymbol{\beta }}}}}}}}^{{{{{{\rm{T}}}}}}})}^{{{{{{\rm{T}}}}}}}\) by fitting a Poisson model with NewtonRaphson. Second, we can fit a negative binomial model with initialization \(\widehat{\phi }=1\) to obtain \({\widehat{{{\mbox{}}}\theta {{\mbox{}}}}}_{2}={({\widehat{\mu }}_{A,1},\,{\widehat{{{{{{\boldsymbol{\beta }}}}}}}}^{{{{{{\rm{T}}}}}}},\widehat{\phi })}^{{{{{{\rm{T}}}}}}}\), also with NewtonRaphson. Third, we incorporate ρ_{i}, initialize \({\widehat{\kappa }}_{q}=1\), and use Broyden–Fletcher–Goldfarb–Shanno (BFGS) to obtain \({\widehat{{{\mbox{}}}\theta {{\mbox{}}}}}_{3}={({\widehat{\mu }}_{A,1},\,{\widehat{{{{{{\boldsymbol{\beta }}}}}}}}^{{{{{{\rm{T}}}}}}},\widehat{\phi },{\widehat{{{\mbox{}}}\kappa {{\mbox{}}}}}^{{{{{{\rm{T}}}}}}})}^{{{{{{\rm{T}}}}}}}\). Fourth, we incorporate Z_{i}, initialize \({\widehat{\eta }}_{q}=1\), and use BFGS to obtain \({\widehat{{{\mbox{}}}\theta {{\mbox{}}}}}_{4}={({\widehat{\mu }}_{A,1},\,{\widehat{{{{{{\boldsymbol{\beta }}}}}}}}^{{{{{{\rm{T}}}}}}},\widehat{\phi },{\widehat{{{\mbox{}}}\kappa {{\mbox{}}}}}^{{{{{{\rm{T}}}}}}},{\widehat{{{\mbox{}}}\eta {{\mbox{}}}}}^{{{{{{\rm{T}}}}}}})}^{{{{{{\rm{T}}}}}}}\). Fifth, we incorporate (N_{i}, N_{i2}), initialize \(\widehat{\psi }=1\), fix \({\widehat{\alpha }}_{q}=1\), and run BFGS to obtain \({\widehat{{{\mbox{}}}\theta {{\mbox{}}}}}_{5}={({\widehat{\mu }}_{A,1},\,{\widehat{{{{{{\boldsymbol{\beta }}}}}}}}^{{{{{{\rm{T}}}}}}},\widehat{\phi },{\widehat{{{\mbox{}}}\kappa {{\mbox{}}}}}^{{{{{{\rm{T}}}}}}},{\widehat{{{\mbox{}}}\eta {{\mbox{}}}}}^{{{{{{\rm{T}}}}}}},\widehat{\psi })}^{{{{{{\rm{T}}}}}}}\). Lastly, we optimize over the full parameter set to obtain \(\widehat{{{\mbox{}}}\theta {{\mbox{}}}}={({\widehat{\mu }}_{A,1},\,{\widehat{{{{{{\boldsymbol{\beta }}}}}}}}^{{{{{{\rm{T}}}}}}},\widehat{\phi },{\widehat{{{\mbox{}}}\kappa {{\mbox{}}}}}^{{{{{{\rm{T}}}}}}},{\widehat{{{\mbox{}}}\eta {{\mbox{}}}}}^{{{{{{\rm{T}}}}}}},\widehat{\psi },{\widehat{{{\mbox{}}}\alpha {{\mbox{}}}}}^{{{{{{\rm{T}}}}}}})}^{{{{{{\rm{T}}}}}}}\), also with BFGS.
This optimization scheme does have inherent challenges to achieve stable convergence. One key regularity condition is that the estimated parameters are not on the boundary of the parameter space, in our case, μ_{A,q} > 0 and μ_{B,q} > 0 corresponding to nonzero expression for each allele and cell type. A second requirement is sufficient variability in ρ_{iq} across samples to estimate κ_{q}, η_{q}, and α_{q}. This is comparable to an identifiability condition for a linear regression where each covariate has nonzero variance. It is likely that these two requirements are not satisfied for some genes or cell types. Therefore the full model or set of estimable parameters needs to be adjusted. Let l_{n}(θ), \({\dot{l}}_{n}({{\mbox{}}}\theta {{\mbox{}}})\), and \({\ddot{l}}_{n}({{\mbox{}}}\theta {{\mbox{}}})\) denote the loglikelihood, score, and (negative) observed information, respectively. Let \({\left\cdot \right}_{2}\) denote the L_{2} norm. Convergence is defined when \({\left{\dot{l}}_{n}(\widehat{{{\mbox{}}}\theta {{\mbox{}}}})\right}_{2} \, < \,{\epsilon }_{1}\), \({\ddot{l}}_{n}({{\mbox{}}}\theta {{\mbox{}}})\) is invertible, no negative variances, and \({\left{\ddot{l}}_{n}{({{\mbox{}}}\widehat{\theta }{{\mbox{}}})}^{1}{\dot{l}}_{n}(\widehat{{{\mbox{}}}\theta {{\mbox{}}}})\right}_{2} \, < \,{\epsilon }_{2}\) for predefined thresholds ϵ_{1} and ϵ_{2}. By default, ϵ_{1} = 10^{−3} and ϵ_{2} = 10^{−6}. To determine which cell typespecific parameters to constrain to their null values (κ_{q} = 0, η_{q} = 1, α_{q} = 1), we run the above optimization procedure and set the unidentifiable parameters to their null values and rerun the optimization procedure, and iterate this procedure until all the remaining parameters are estimable. More specifically, we initialize our parameters with \({\widehat{{{\mbox{}}}\theta {{\mbox{}}}}}_{2}\) and perform the following operations:

First, we estimate \({\widehat{{{\mbox{}}}\theta {{\mbox{}}}}}_{2}={({\widehat{\mu }}_{A,1},\,{\widehat{{{{{{\boldsymbol{\beta }}}}}}}}^{{{{{{\rm{T}}}}}}},\widehat{\phi })}^{{{{{{\rm{T}}}}}}}\) by maximum likelihood estimate (MLE), while ignoring eQTL and cell type composition.

Next we estimate \({\widehat{{{\mbox{}}}\theta {{\mbox{}}}}}_{3}={({\widehat{\mu }}_{A,1},\,{\widehat{{{{{{\boldsymbol{\beta }}}}}}}}^{{{{{{\rm{T}}}}}}},\widehat{\phi },{\widehat{{{\mbox{}}}\kappa {{\mbox{}}}}}^{{{{{{\rm{T}}}}}}})}^{{{{{{\rm{T}}}}}}}\). The κ_{q} parameters are estimated relative to the reference cell type (q = 1) and therefore we must ensure the reference cell type has nonzero TReC (\({\widehat{\mu }}_{A,1} \, > \,0\)). By default we set the reference cell type to be the one with highest average proportion across samples. After estimating \(\widehat{{{\mbox{}}}\kappa {{\mbox{}}}}\), we can determine which cell type has highest TReC and swap that cell type to be the reference cell type. This choice of reference cell type can vary from gene to gene. It is an internal choice for the computation purpose and in the final output, all the parameters are transformed using the cell type with highest average proportion as reference. If θ_{3} cannot be reliably estimated, it indicates that some κ_{q}’s are close to 0. We calculate \({\widehat{\mu }}_{AA,q}\equiv 2{\widehat{\mu }}_{A,1}{\widehat{\kappa }}_{q}\). If \({\min }_{q}({\widehat{\mu }}_{AA,q}) \, < \,2\) (each haplotype expresses at least one TReC), set \({\widehat{\kappa }}_{q}=0\) and \({\widehat{\eta }}_{q}={\widehat{\alpha }}_{q}=1\), and then reoptimize.

Next we estimate eQTL effects η in θ_{4} and θ_{5}. If convergence is achieved, move on to the next step. Otherwise, the ASReCs of one or more cell types are near zero. Then we calculate \({\widehat{\mu }}_{Aq}\equiv {\widehat{\mu }}_{A,1}{\widehat{\kappa }}_{q}\), \({\widehat{\mu }}_{Bq}\equiv {\widehat{\mu }}_{Aq}{\widehat{\eta }}_{q}\), and \({\widehat{\mu }}_{zq}\equiv \min ({\widehat{\mu }}_{Aq},{\widehat{\mu }}_{Bq})\), and variance estimate for \(\log ({\widehat{\eta }}_{q})\) (optimizing over unconstrained parameters). If \(0 \, < \,{\widehat{\mu }}_{zq} \, < \,1\) or η_{q} variance estimate is negative, set \({\widehat{\eta }}_{q}=1\) and reoptimize and repeat until convergence. For each cell type where \({\widehat{\eta }}_{q}=1\), set \({\widehat{\alpha }}_{q}=1\) for subsequent steps.

Next we estimate α in θ_{6}. If convergence is achieved, we have established the full model. Otherwise, check for variance estimates for \(\log ({\widehat{\alpha }}_{q})\) that are negative and set \({\widehat{\alpha }}_{q}=1\) and reoptimize. If none of the cell types α_{q} variances are negative and convergence is not achieved, identify the cell type with largest α_{q} variance and set \({\widehat{\alpha }}_{q}=1\).
In general, whenever we need to reoptimize, we simply start off at the step prior to the current step, there is no need to return to \({\widehat{{{\mbox{}}}\theta {{\mbox{}}}}}_{2}\) since the procedure has established the “submodel” or nested set of estimated parameters that achieved convergence. In addition, our procedure is strictly designed to assess convergence first at each step before attempting to constrain parameter estimates or looking to the variance estimates to avoid unnecessary matrix inversions until all other criteria are met.
Trimming influential counts
Data trimming and quality control are a crucial issue associated with regression analyses^{43}. Modeling observed outcomes directly risks highly influential or potential outlier data points that contribute to biased parameter estimates and inflated type I error. For eQTL analysis, if the same subset of samples were consistently identified as outlier, we could exclude them. But among post quality control samples, an analysis could involve potentially excluding different subsets of samples per gene, risking a power loss and difficulty to interpret the results. In the case of differential expression, DESeq2^{21} systematically trims outlier observations whose Cook’s distance is beyond a predefined cutoff based on the Fdistribution. We have adopted a similar trimming approach in our eQTL analyses.
For a given gene, we characterize the influence of a sample through our definition of Cook’s distance for the ith sample with:
where m = p + Q − 1, \({\widehat{\mu }}_{j}\) is the estimated mean TReC for the jth sample, \({\widehat{\mu }}_{j(i)}\) is the estimated mean TReC for the jth sample after excluding the ith sample, and \({\widehat{v}}_{j}={\widehat{\mu }}_{j}+\widehat{\phi }{\widehat{\mu }}_{j}^{2}\), the estimated TReC variance for the jth sample. Since our TReC model is not the traditional GLM due to the samplespecific offset term \((\log ({\sum }_{q=1}^{Q}{\rho }_{iq}{\kappa }_{q}))\), we cannot directly characterize leverage. We then calculate normalized Cook’s distance to put Cook’s distances on the same scale across genes, denoted \({\tilde{C}}_{i}\) and characterized as:
where med(…) and mad(…) denote median and median absolute deviation, respectively. We propose trimming the original TReC (T_{i}) if \({\tilde{C}}_{i} \, > \,c\) where c is some predefined threshold. To calculate Cook’s distance, we fit CSeQTL’s TReC model without adjusting for SNP since a gene can have multiple SNPs and a gene’s TReC can be influential regardless of genotype. We explored the possibility of using C_{i} > 4/n and C_{i} > F(q = 0.99, m, n − m) as a trimming criteria however it failed to detect clear visual outliers. We decided on an appropriate threshold on \({\tilde{C}}_{i}\) by running CSeQTL on chr1 genes with permuted SNP genotypes. We tried cutoff thresholds 40, 20, 10, and 5. The largest threshold that controls the type I error was selected. Unlike the trimmed means used by DESeq2 to impute the TReC value, we impute the TReC value with the estimated TReC for a sample from CSeQTL’s TReC model without SNP adjustment.
Simulation setup
We describe how the cell type proportions are simulated. In the first scenario, let X ~ U(a, b) denote a random variable X sampled from a continuous uniform distribution ranging from a to b. Specifically \({\rho }_{iq}=\exp \{{U}_{iq}\}/\mathop{\sum }\nolimits_{s=1}^{Q}\exp \{{U}_{is}\}\) and U_{iq} ~ U(−4, 4). In the second scenario, to allow cell types to reflect observed proportions with wide and narrow ranges of proportions, we simulated ρ_{i1} from a beta distribution with shape parameters 10 and 24 (values derived based on maximum likelihood estimates from fitting a beta distribution to CMC’s astrocyte cell type proportions), \({\rho }_{i2}=\left0.850.76{\rho }_{i1}0.03{\rho }_{i1}^{2}+{\epsilon }_{i}\right\), where ϵ_{i} was sampled from a centered normal distribution with standard deviation 0.02, and ρ_{i3} = 1 − ρ_{i1} − ρ_{i2}. If ρ_{i3} < 0, we set it to zero and normalize the proportions across cell types. For the third scenario, proportions are first simulated under the second scenario. Next, for each cell type, the initial proportions greater than the 99% quantile were replaced by values sampled from U(0.7, 0.9) while initial proportions less than the 1% quantile were replaced by values sampled from U(0, 0.1). These final values are renormalized across cell types to sum to one.
For n = 300, we simulate p = 4 baseline covariates. The first covariate is X_{i1}, which represents readdepth, is simulated by a gamma distribution with shape parameter set to 600 and rate parameter set to 100, based on empirical MLE estimates from CMC samples. X_{i2}, which represent sex, is generated by a Bernoulli distribution with success probability of 0.5. X_{i3} is generated by a continuous uniform distribution ranging from −1 to 1. X_{i4} is simulated by a standard normal distribution. These latter two variables represent arbitrarily distributed continuous covariates. Continuous covariates X_{i1}, X_{i3}, and X_{i4} are centered and scaled with zero mean and unit variance. Assuming Hardy Weinberg equilibrium, genotypes were generated using a categorical distribution with probabilities \({(1{m}_{A})}^{2}\), m_{A}(1 − m_{A}), m_{A}(1 − m_{A}), \({m}_{A}^{2}\) for AA, AB, BA, BB, respectively, where m_{A} denotes the minor allele frequency. We set m_{A} = 0.2.
Grouping 22 blood cell types to seven cell types
The “CD4T” cell type is defined by pooling CD4 naive, CD4 memory resting, CD4 memory activated, follicular helper, regulatory T cells (Tregs) and gamma delta cells. The gamma delta T cells is indeed a different type of T cells while all other type of T cells are alpha beta T cells. However, its proportion is very low (Supplementary Fig. 8) and thus adding it to any other cell type does not lead to any noticeable change of cell type composition. Here we combine it into the CD4T category just for the convenience of implementation. The “B_Cell” cell type is the result of combining B cell naive, B cell memory and Plasma cells. CD8 T cells were not collapsed with other cell types and simply denoted “CD8T”. The “Mast_Eosinophil” cell type is composed of mast cells resting, mast cells activated, dendritic cells resting, dendritic cells activated and eosinophils. The “NK” cell type comprises of natural killer cells resting and natural killer cells activated. The “Monocytes” cell type is made up of monocytes, macrophages M0, macrophages M1 and macrophages M2. The proportion of microphage cells are very low (Supplementary Fig. 8) and thus adding them to monocytes does not make substantial changes to monocyte proportions. We further discuss the algebraic interpretation and implications of combining cell types in Supplementary Note 2.5.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Our work did not generate any new data. We have used publicly available datasets. BLUEPRINT from European Genome Archive with phased SNPs derived from whole genome sequencing (EGAD00001002663) and three purified cell types of RNAseq bam files (EGAD00001002671, EGAD00001002674, EGAD00001002675). CommonMind data were downloaded from https://www.nimhgenetics.org/resources/commonmind. GTEx data were downloaded from NHGRI AnVIL (Genomic Data Science Analysis, Visualization, and Informatics Labspace). Brain MTG data were downloaded from Allen Brain Institute Website https://portal.brainmap.org/atlasesanddata/rnaseq/humanmtgsmartseq. SEAAD snRNAseq data were downloaded from cellxgene: https://cellxgene.cziscience.com/collections/1ca90a2d2943483db678b809bf464c30.
Code availability
The source codes for R package CSeQTL and analysis pipeline are made publicly available at the Github repositories https://github.com/pllittle/CSeQTL(https://doi.org/10.5281/zenodo.7901725), and https://github.com/pllittle/CSeQTLworkflow (https://doi.org/10.5281/zenodo.7901800), respectively.
References
Regev, A. et al. Science forum: the human cell atlas. elife 6, e27041 (2017).
Wang, D. et al. Comprehensive functional genomic resource and integrative model for the human brain. Science 362, eaat8464 (2018).
GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
KimHellmuth, S. et al. Cell typespecific genetic regulation of gene expression across human tissues. Science 369, eaaz8528 (2020).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genomewide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Skene, N. G. et al. Genetic identification of brain cell types underlying schizophrenia. Nat. Genet. 50, 825 (2018).
Zhu, H., Shang, L. & Zhou, X. A review of statistical methods for identifying traitrelevant tissues and cell types. Front. Genet. 11, 587887 (2021).
Wang, R., Lin, D. Y. & Jiang, Y. Epic: Inferring relevant cell types for complex traits by integrating genomewide association studies and singlecell RNA sequencing. PLoS Genet. 18, e1010251 (2022).
Burgess, D. J. Getting dynamic with eQTLs. Nat. Rev. Genet. 20, 500–501 (2019).
Strober, B. J. et al. Dynamic genetic regulation of gene expression during cellular differentiation. Science 364, 1287–1290 (2019).
Glastonbury, C. A., Alves, A. C., Moustafa, J. S. E. S. & Small, K. S. Celltype heterogeneity in adipose tissue is associated with complex traits and reveals diseaserelevant cellspecific eQTLs. Am. J. Hum. Genet. 104, 1013–1024 (2019).
Westra, H. J. et al. Cell specific eQTL analysis without sorting cells. PLoS Genet. 11, e1005223 (2015).
Aran, D., Hu, Z. & Butte, A. J. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 18, 220 (2017).
Ng, B. et al. An xQTL map integrates the genetic architecture of the human brain’s transcriptome and epigenome. Nat. Neurosci. 20, 1418–1426 (2017).
Donovan, M. K., D’AntonioChronowska, A., D’Antonio, M. & Frazer, K. A. Cellular deconvolution of GTEx tissues powers discovery of disease and celltype associated regulatory variants. Nat. Commun. 11, 1–14 (2020).
AguirreGamboa, R. et al. Deconvolution of bulk blood eQTL effects into immune cell subpopulations. BMC Bioinform. 21, 1–23 (2020).
Patel, D. et al. Celltypespecific expression quantitative trait loci associated with Alzheimer disease in blood and brain tissue. Transl. Psychiatry 11, 250 (2021).
Sun, W. & Hu, Y. eqtl mapping using RNAseq data. Stat. Biosci. 5, 198–219 (2013).
Zhabotynsky, V. et al. eQTL mapping using allelespecific count data is computationally feasible, powerful, and provides individualspecific estimates of genetic effects. PLoS Genet. 18, e1010076 (2022).
Wilson, D. R., Ibrahim, J. G. & Sun, W. Mapping tumorspecific expression QTLs in impure tumor samples. J. Am. Stat. Assoc. 115, 1–18 (2019).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 15, 550 (2014).
Fromer, M. et al. Gene expression elucidates functional impact of polygenic risk for schizophrenia. Nat. Neurosci. 19, 1442–1453 (2016).
Hoffman, G. E. et al. CommonMind Consortium provides transcriptomic and epigenomic data for schizophrenia and bipolar disorder. Sci. Data 6, 1–14 (2019).
Wilson, D. R., Jin, C., Ibrahim, J. G. & Sun, W. ICeDT provides accurate estimates of immune cell abundance in tumor samples by allowing for aberrant gene expression patterns. J. Am. Stat. Assoc. 115, 1055–1065 (2020).
Hodge, R. D. et al. Conserved cell types with divergent features in human versus mouse cortex. Nature 573, 61–68 (2019).
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).
Sun, W. & Wright, F. A. A geometric interpretation of the permutation pvalue and its application in eQTL studies. Ann. Appl. Stat. 4, 1014–1033 (2010).
Storey, J. D. The positive false discovery rate: a Bayesian interpretation and the qvalue. Ann. Stat. 31, 2013–2035 (2003).
Sun, W. A statistical framework for eQTL mapping using RNAseq data. Biometrics 68, 1–11 (2012).
Chen, L. et al. Genetic drivers of epigenetic and transcriptional variation in human immune cells. Cell 167, 1398–1414 (2016).
Yazar, S. et al. Singlecell eqtl mapping identifies cell type–specific genetic control of autoimmune disease. Science 376, eabf3041 (2022).
Bryois, J. et al. Celltypespecific ciseQTLs in eight human brain cell types identify novel risk genes for psychiatric and neurological disorders. Nat. Neurosci. 25, 1104–1112 (2022).
Wen, X. Molecular QTL discovery incorporating genomic annotations using Bayesian false discovery rate control. Ann. Appl. Stat. 10, 1619–1638 (2016).
Buniello, A. et al. The NHGRIEBI GWAS catalog of published genomewide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Dudek, M. et al. Autoaggressive CXCR6+ CD8 T cells cause liver immune pathology in nash. Nature 592, 444–449 (2021).
Bénéchet, A. P. et al. Dynamics and genomic landscape of CD8+ T cells undergoing hepatic priming. Nature 574, 200–205 (2019).
Wong, Y. C., Tay, S. S., McCaughan, G. W., Bowen, D. G. & Bertolino, P. Immune outcomes in the liver: is CD8 T cell fate determined by the environment? J. Hepatol. 63, 1005–1014 (2015).
John, B. & Crispe, I. N. Passive and active mechanisms trap activated CD8+ T cells in the liver. J. Immunol. 172, 5222–5229 (2004).
Breuer, D. A. et al. CD8+ T cells regulate liver injury in obesityrelated nonalcoholic fatty liver disease. Am. J. Physiol. Gastrointest. Liver Physiol. 318, G211–G224 (2020).
Zhong, Y. & Liu, Z. Gene expression deconvolution in linear space. Nat. Methods 9, 8–9 (2012).
Hu, Y. J., Sun, W., Tzeng, J. Y. & Perou, C. M. Proper use of allelespecific expression improves statistical power for ciseQTL mapping with RNAseq data. J. Am. Stat. Assoc. 110, 962–974 (2015).
Wright, F. A. et al. Heritability and genomics of gene expression in peripheral blood. Nat. Genet. 46, 430–437 (2014).
Allen, M. The SAGE Encyclopedia of Communication Research Methods (Sage Publications, 2017).
Acknowledgements
NIH grant R01 GM105785 for P.L., S.L., V.Z., and W.S., R56 AG079291 for Y.L., and R01HG009974 for D.Y.L.
Author information
Authors and Affiliations
Contributions
W.S., D.Y.L., and Y.L. supervised the project. P.L. and W.S. designed the method, acquired and preprocessed the four datasets used in this paper. P.L. wrote the software package to perform simulation and real data analyses. V.Z. provided the geoP software. S.L. and W.S. provided the validation analyses. P.L., S.L., and W.S. wrote the manuscript, with input from D.Y.L., Y.L., and V.Z.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Cathal Seoighe, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Little, P., Liu, S., Zhabotynsky, V. et al. A computational method for cell typespecific expression quantitative trait loci mapping using bulk RNAseq data. Nat Commun 14, 3030 (2023). https://doi.org/10.1038/s4146702338795w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146702338795w
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.