Cell-type-specific and disease-associated expression quantitative trait loci in the human lung

Common genetic variants confer substantial risk for chronic lung diseases, including pulmonary fibrosis. Defining the genetic control of gene expression in a cell-type-specific and context-dependent manner is critical for understanding the mechanisms through which genetic variation influences complex traits and disease pathobiology. To this end, we performed single-cell RNA sequencing of lung tissue from 66 individuals with pulmonary fibrosis and 48 unaffected donors. Using a pseudobulk approach, we mapped expression quantitative trait loci (eQTLs) across 38 cell types, observing both shared and cell-type-specific regulatory effects. Furthermore, we identified disease interaction eQTLs and demonstrated that this class of associations is more likely to be cell-type-specific and linked to cellular dysregulation in pulmonary fibrosis. Finally, we connected lung disease risk variants to their regulatory targets in disease-relevant cell types. These results indicate that cellular context determines the impact of genetic variation on gene expression and implicates context-specific eQTLs as key regulators of lung homeostasis and disease.

Selecting optimal sample and cell number thresholds for pseudo-bulk eQTL mapping is challenging in the absence of ground truth.Here, drawing on the best practices presented in Cuomo et al. 1 , we selected the inclusion criteria to maximize our ability to map eQTL in confidence across many cell types.Specifically, we included only samples with at least 5 cells for a given cell type.This threshold was selected to be loose enough to minimize donor loss, while still eliminating donors with poor expression support, as demonstrated by simulations and benchmarking by Cuomo et al.
This selection threshold aims to balance the trade-off of power and noise in pseudobulk eQTL analysis: a higher cell number threshold per pseudobulk profile produces less noisy profiles as they average over more single cells, but results in loss of cell types for analysis (e.g., if they have too few donors for inclusion) and loss of power in remaining cell types due to loss of donors/individuals and thus reduced sample size for eQTL mapping in some cell types.A more lenient threshold retains more cell types for analysis, getting a fuller picture of genetic regulation of gene expression throughout the tissue, and maximizes eQTL detection power.However, it can produce noisier pseudobulk expression profiles, possibly resulting in false positives or negatives.
A threshold of 5 cells per sample for a given cell type was previously used by Cuomo et al. in their simulation studies and detailed benchmarking of single-cell eQTL mapping results against those derived from matched bulk RNA-seq data.When exploring the effects of the simulated distributions of the cell numbers allocated for each sample and metrics such as power, empirical FDR, and beta correlation of the number of donors and the average number of cells per donor, Cuomo et al. demonstrate that the empirical FDR is consistent (~0.07) across values for average number of cells per sample (Cuomo et al.Supplementary Figure 7b).Examining the distributions of cells per donor in Cuomo et al. and this study (below), we find that the distributions are comparable, and thus, the simulation results are directly relevant to the present study.Thus, we expect our eQTL mapping results to not be enriched for false positives with the threshold of 5 cells.Using a more stringent threshold would result in a loss of samples available for eQTL mapping, with a substantial impact on discovery power and empirical FDR (50 vs. 87 donors in Cuomo et al.Supplementary Figure 7a).Thus, we have aimed to maximize the number of donors available for analysis to bring the empirical FDR as close as possible to the nominal FDR.
Distributions of numbers of cells allocated for each sample in simulations with an average of 50 or 120 cells, and the distribution of cells per donor per cell type in this study.With the threshold of 5 cells per sample, all cell types included in the analysis had a minimum of 42 samples, 75% of cell types had at least 56 samples, and 50% had at least 76 samples.

Min
Comparing this distribution to Supplementary Figure 7 in Cuomo et al., we can confidently maximize detection power and control FDR.
We further evaluated the impacts of the minimum cell number on the number of available donors and cell types to be included in the eQTL analysis.As lung is a complex tissue with many distinct cell types, we endeavor to maximize the inclusion of as many cell types, as far as is reasonable, to gain as full a picture as possible of the landscape of genetic regulation of gene expression in healthy and diseased lungs.
Setting a threshold of at least 40 donors to include a cell type in eQTL mapping and downstream analyses, with the threshold we used (>=5 cells) 38 cell types were available for eQTL mapping.With a threshold of 10, 20, 30, or 50 cells, 30, 25, 21, and 16 cell types are retained, respectively.With a threshold of at least 50 cells per donor per cell type, we would only be able to map eQTL in 30-50% of major cell types in the lung.Using a threshold of at least 5 cells per donor allows us to map eQTL for 38 cell types, almost complete coverage of the major lung cell types.
Finally, because the mashr joint modeling approach requires a complete eQTL matrix (i.e., no missing estimates for any eQTL for any cell type), eQTL mapping is carried out on some genes in certain cell types where there is relatively low power to detect eQTL effects.The application of mashr itself, by modeling covariance and effect sharing between cell types, drastically improves eQTL detection power even in underpowered cell types.Further, after applying mashr, a stricter inclusion filter is applied where mashr-adjusted eQTL effects are only reported for genes meeting the expression criteria for that given cell type.Thus, on balance, the use of mashr supports a minimum cell threshold of 5 cells to include as many cell types in the analysis as possible and, simultaneously, using mashr mitigates against potential issues that might otherwise arise in eQTL mapping from cell types with smaller sample sizes and therefore lower power.Our ability to account for FDR is demonstrated by the permutations (Supplementary Figure 7-8).