Main

Genome-wide association studies (GWASs) have successfully identified thousands of disease-associated variants1,2,3, but the cellular mechanisms through which these variants drive complex diseases and traits remain largely unknown. This is due to several challenges, including the difficulty of relating the approximately 95% of risk variants that reside in noncoding regulatory regions to the genes they regulate4,5,6,7 and our limited knowledge of the specific cells and functional programs in which these genes are active8. Previous studies have linked traits to functional elements9,10,11,12,13,14,15 and to cell types using bulk RNA-sequencing (RNA-seq) profiles16,17,18. Considerable work remains to analyze cell types and states at finer resolutions across a breadth of tissues, incorporate disease tissue-specific gene expression patterns, model cellular processes within and across cell types and leverage enhancer–gene links19,20,21,22,23 to improve power.

Single-cell RNA-seq (scRNA-seq) data provide a unique opportunity to tackle these challenges24. Single-cell profiles allow the construction of multiple gene programs to more finely relate GWAS variants to function, including programs that reflect cell-type-specific signatures25,26,27,28, disease-dependent signatures within cell types29,30 and key cellular processes that vary within and/or across cell types31. Initial studies have related single-cell profiles with human genetics in post-hoc analyses by mapping candidate genes from disease-associated genomic regions to cell types by their expression relative to other cell types32,33,34. More recent studies have begun to leverage genome-wide polygenic signals to map traits to cell types from single cells within the context of a single tissue35,36,37. However, focusing on a single tissue could, in principle, result in misleading conclusions, because disease mechanisms span tissue types across the human body. For example, in the context of the colon, a neural gene associated with psychiatric disorders would appear highly specific to enteric neurons, but this cell population may no longer be strongly implicated when the analysis also includes cells from the human central nervous system38. Thus, there is a need for a principled method that combines human genetics and comprehensive scRNA-seq applied across multiple tissues and organs.

In the present study, we develop and apply sc-linker, an integrated framework to relate human disease and complex traits to cell types and cellular processes by integrating GWAS summary statistics, epigenomics and scRNA-seq data from multiple tissue types, diseases, individuals and cells. Unlike previous studies, we analyze gene programs that represent different functional facets of cells, including discrete cell types, processes activated specifically in a cell type in disease and processes activated across cells irrespective of cell-type definitions (recovered by latent factor models). We transform gene programs to SNP annotations using tissue-specific enhancer–gene links19,20,21,22,23 in preference to standard gene window-based linking strategies used in existing gene-set enrichment methods such as MAGMA39, RSS-E13 and linkage disequilibrium score regression (LDSC)-specifically expressed genes18. We then link SNP annotations to diseases by applying stratified LDSC11 (S-LDSC) using the baseline-LD model40,41 to the resulting SNP annotations. We further integrate cellular expression and GWAS to prioritize specific genes in the context of disease-critical gene programs, thus providing new insights into underlying disease mechanisms.

Results

We developed a framework to link gene programs derived from scRNA-seq with diseases and complex traits (Fig. 1a). First, we use scRNA-seq to construct gene programs, defined as continuous-valued gene sets, that characterize (1) individual cell types, (2) disease-dependent (disease versus healthy cells of the same type) or (3) cellular processes (cell cycling, endoplasmic reticulum stress). (The continuous values are on the probabilistic 0–1 scale but do not formally represent probabilities (Methods).) Then, we link the genes underlying these programs to SNPs that regulate them by incorporating two tissue-specific, enhancer–gene-linking strategies: Roadmap Enhancer-Gene Linking19,20,21 and the Activity-by-Contact (ABC) model22,23. Finally, we evaluate the disease informativeness of the resulting SNP annotations by applying S-LDSC11 conditional on a broad set of coding, conserved, regulatory and LD-related annotations from the baseline-LD model40,41. Altogether, our approach links diseases and traits with gene programs recapitulating cell types and cellular processes. We have released open-source software implementing the approach (sc-linker; see Code availability), a web interface for visualizing the results (Data availability) and postprocessed scRNA-seq data, gene programs, enhancer–gene-linking strategies and SNP annotations analyzed in the present study (Data availability). A more comprehensive overview is provided in Supplementary Note.

We analyzed a broad range of human scRNA-seq data, spanning 17 datasets from 11 tissues and 6 disease conditions. The 11 nondisease tissues included blood/immune (peripheral blood mononuclear cells (PBMCs)26,42, cord blood27 and bone marrow27), brain28, kidney43, liver44, heart25, lung29, colon34, skin45 and adipose tissue44. The six disease conditions included multiple sclerosis (MS, brain)46, Alzheimer’s disease (AD, brain)30, ulcerative colitis (UC, colon)34, asthma, lung47, idiopathic pulmonary fibrosis (IPF), lung29 and COVID-19, bronchoalveolar lavage fluid (BAL)48 (Extended Data Fig. 1). In total, the scRNA-seq data included 209 individuals, 1,602,614 cells and 256 annotated cell subsets (Methods and Supplementary Table 1). We also compiled publicly available GWAS summary statistics for 60 unique diseases and complex traits (genetic correlation <0.9; average N = 297,000) (Methods and Supplementary Table 2). We analyzed gene programs from each scRNA-seq dataset together with each of 60 diseases and complex traits, but we primarily reported those that are most pertinent for each program.

As a proof of principle, we benchmarked the sc-linker by analyzing five blood cell traits that biologically correspond to specific immune cell types (Supplementary Table 2) using immune cell-type programs constructed from scRNA-seq data (Fig. 2a,b and Extended Data Fig. 1). We constructed six immune cell-type programs that were identified across four datasets: two from PBMCs (k = 4,640 cells, n = 2 individuals26; k = 68,551, n = 8 (ref. 42)) and one each of cord blood27 (k = 263,828, n = 8) and bone marrow27 (k = 283,894, n = 8). We identified enrichment of erythroid cells for red blood cell count, megakaryocytes for platelet count, monocytes for monocyte count and B cells and T cells for lymphocyte percentage (Fig. 2d and Extended Data Fig. 2a); these enrichments reflect known biological roles and have been reported in previous studies49,50, such that we refer to them as expected enrichments.

We defined a sensitivity/specificity index quantifying the presence of expected enrichments and absence of other enrichments (Methods). A limitation of this index is that other enrichments may be biologically real in some cases; thus, we also consider sensitivity to detect expected enrichments (Methods). The sc-linker outperformed the MAGMA39 gene-set-level association method in terms of the sensitivity/specificity index (Fig. 2c). Benchmarks on the sc-linker method, the choice of enhancer–gene-linking strategies and cell-type programs are included in Supplementary Note.

Distinguishing the cells involved in immune-related diseases

We next analyzed eleven autoimmune diseases (Supplementary Table 2) using the six immune cell-type programs above (Fig. 2a,b and Extended Data Fig. 1) and ten (intracell and intercell types) immune cellular process programs (Fig. 2f). (Enrichment results for the remaining 49 diseases and traits with immune cell-type programs are reported in Extended Data Fig. 3; we did not construct disease-dependent programs, because these datasets included healthy samples only.) We identified cell-type-disease enrichments that conform to known disease biology (Fig. 2e and Extended Data Fig. 2b), including T cells for eczema51,52, B and T cells for primary biliary cirrhosis (PBC)18 and dendritic cells (DCs) and monocytes for AD53. In addition, the highly statistically significant enrichments for MS across all six immune cell-type programs analyzed are consistent with previous analyses18,54,55,56, supporting the validity of our approach.

Several of the significant cell-type-disease enrichments have limited literature support and may implicate previously unexplored biological mechanisms (Fig. 2e, Table 1 and Extended Data Fig. 2b). For example, we detected significant enrichment in B cells for UC; B cells have been detected in basal lymphoid aggregates in the UC in the colon, but their pathogenic significance remains unknown57. In addition, T cells were highly enriched for celiac disease, the top driving genes including ETS1 (ranked 1), associated with T cell development and interleukin (IL)-2 signaling58, and CD28 (ranked 3), critical for T cell activation. This suggests that aberrant T cell maintenance and activation may impact inflammation in celiac disease. Recent reports of a permanent loss of resident γδ T cells in the celiac bowel and the subsequent recruitment of inflammatory T cells may further support this hypothesis59. These results were recapitulated across an independent immune cell scRNA-seq dataset, in both the gene programs (average correlation: 0.78 for the same cell type) and the disease enrichments (0.86 correlation of the E-score over all cell-type and -trait pairs). A cross-trait analysis of the patterns of cell-type enrichments suggests that celiac disease and rheumatoid arthritis involve cell-mediated adaptive immune response, UC and PBC involve antibody-mediated adaptive immune response, AD has a strong signal of innate immune and MS and inflammatory bowel disease (IBD) involve contributions from a wide range of immune cell types (Extended Data Fig. 4).

Analyzing the ten immune cellular process programs (Fig. 2f) across the eleven immune-related diseases and five blood cell traits, we identified both disease-specific enrichments and others that shared across diseases (Fig. 2g and Table 1). For example, although T cells have been previously linked to eczema, we pinpointed higher enrichment in CD4+ T cells compared with CD8+ T cells. The IL-2 signaling cellular process program in T and B cells was significantly enriched for both eczema and celiac disease, although the genes driving the enrichment were not significantly overlapping (P = 0.21). In addition, the complement cascade cellular process program in plasma, B cells and hematopoietic stem cells was most highly enriched among all intercellular programs for celiac disease. For AD, there was a strong enrichment in both classic and nonclassic, monocyte intracell-type cellular programs, and in major histocompatibility complex class II (MHC-II) antigen presentation (intercell type: dendritic cells (DCs) and B cells) and prostaglandin biosynthesis (intercell type: monocytes, DCs, B cells and T cells) programs. Among the notable driver genes were IL7R (ranked 1) and NDFIP1 (ranked 3) for CD4+ T cells in eczema, which respectively play key roles in helper T cell 2 differentiation60,61 and in mediating peripheral CD4 T cell tolerance and allergic reactions62,63, and CD33 (ranked 1) in MHC-II antigen processing in AD, a microglial receptor strongly associated with increased risk in previous GWASs64,65.

Linking GABA-ergic and glutamatergic neurons to psychiatric disease

We next focused on brain cells and psychiatric disease, by analyzing 9 cell-type programs (Fig. 3a) and 12 cell process programs (Fig. 3e; 10 intra- and 2 intercell-type programs) from scRNA-seq data of healthy brain prefrontal cortex (k = 73,191, n = 10)28 (Supplementary Table 1) with 11 psychiatric or neurological diseases and traits (Supplementary Table 2).

Notably, we observed enrichments of major depressive disorder (MDD) and body mass index (BMI) specifically in γ-aminobutyric acid (GABA)-ergic neurons, whereas insomnia, schizophrenia and intelligence were highly enriched, specifically in glutamatergic neurons, and neuroticism was highly enriched in both. GABA-ergic neurons regulate the brain’s ability to control stress levels, which is the most prominent vulnerability factor in MDD66 (Fig. 3b,c, Table 1 and Extended Data Fig. 2c). Among the top genes driving this enrichment were TCF4 (ranked 1), a critical component for neuronal differentiation that affects neuronal migration patterns67,68, and PCLO (ranked 4), which is important for synaptic vesicle trafficking and neurotransmitter release69. Although predominant therapies for MDD target monoamine neurotransmitters, especially serotonin, the enrichment for GABA-ergic neurons is independent of serotonin pathways, suggesting that they might include other therapeutic targets for MDD. These results were robustly detected in an independent brain scRNA-seq dataset, in both the gene programs (average correlation: 0.77 for the same cell type and −0.21 otherwise) and the disease enrichments (0.77 correlation of the E-score over all cell-type and -trait pairs), including GABA-ergic neurons in MDD and BMI as well as glutamatergic neurons in insomnia and schizophrenia. Enrichment results for the remaining 49 diseases and traits together with brain cell-type programs are reported in Extended Data Fig. 3.

Tissue specificity of both the cell-type program and the enhancer–gene strategy was important for successful linking, which we found by comparing the enrichment of all four possible combinations of immune or brain cell-type programs with immune- or brain-specific enhancer–gene-linking strategies, meta-analyzed across 11 immune-related diseases or 11 psychiatric/neurological diseases and traits (Fig. 3d). This highlights the importance of leveraging the tissue specificity of enhancer–gene strategies.

The 12 brain cellular process programs showed that the significant enrichment of brain-related diseases in the neuronal cell types above is primarily driven by finer programs reflecting neuron subtypes (Fig. 3f, Table 1 and Supplementary Note). For example, the enrichment of GABA-ergic neurons for BMI was driven by programs reflecting LAMP5+ and VIP+ cell subsets with higher expression of LAMP5 and VIP, respectively. Furthermore, the enrichment of GABA-ergic neurons for MDD reflects SST+ and PVALB+ cell subsets with higher expression of SST and PVALB, respectively. We also observed enrichment in more specific cell subsets within glutamatergic neurons (for example, inferior temporal (IT) neurons were enriched for neuroticism).

Linking cell types from diverse human tissues to disease

Analysis of kidney, liver, heart, skin and adipose tissuse cell types (Supplementary Table 1) and corresponding relevant traits (Supplementary Table 2) revealed the role of particular immune, stromal and epithelial cellular compartments across different diseases/traits. For example, kidney and liver cell-type programs (Extended Data Fig. 1) highlighted relations with urine biomarker traits (Fig. 4a and Extended Data Figs. 3 and 5a,b), such as enrichment for creatinine level in kidney proximal and connecting tubule cell types, but not in liver cell types, as expected70,71, or a significant enrichment for bilirubin level only in liver hepatocytes (driven by ANGPTL3; ranked 4)72,73. In heart (Fig. 4b, Extended Data Figs. 3 and 5c and Table 1), atrial cardiomyocytes were enriched for atrial fibrillation, and pericytes and smooth muscle cells for blood pressure, consistent with their respective roles in determining heart rhythm through activity74 of ion channels (top genes included the ion channel genes PKD2L2 (ranked 2), CASQ2 (ranked 7) and KCNN2 (ranked 18)) and blood pressure regulation through vascular tone75 (top driving genes included adrenergic pathway genes PLCE1 (ranked 1), CACNA1C (ranked 21) and PDE8A (ranked 23)). In skin (Fig. 4c, Extended Data Fig. 3 and Table 1), both brain-derived neurotrophic factor signaling and Langerhans’ cells were enriched for eczema. Langerhans’ cells have been implicated in inflammatory skin processes related to eczema76 (top driving genes included IL-2-signaling pathway genes (FCER1G (ranked 3), NR4A2 (ranked 26) and CD52 (ranked 43)), which modulate eczema pathogenesis77). In adipose (Fig. 4d and Extended Data Figs. 3 and 5e), adipocytes were enriched for BMI, driven by adipogenesis pathway genes78 (STAT5A (ranked 15), EBF1 (ranked 29), LIPE (ranked 45)) and triglyceride biosynthesis genes78 (GPAM (ranked 14), LIPE (ranked 45), both of which contribute to the increase in adipose tissue mass in obesity79,80).

We expanded our analysis to evaluate all cell-type programs for all diseases, irrespective of the tissue locus of disease, aiming to identify cell-type enrichments involving ‘mismatched’ cell-type disease/trait pairs (Supplementary Fig. 5). As expected, in most cases, ‘mismatched’ cell-type programs and disease/trait pairs do not yield significant association. Notable exceptions included enrichments of skin Langerhans’ cells for AD (E-score: 15.2, P = 10−4), M cells (in colon) for asthma (E-score: 2.2, P = 10−4) and heart smooth muscle cells for lung capacity (E-score: 5.6, P = 3 × 10−4). In some cases, the association may indicate a direct relationship, whereas in others the associated cell type may only ‘tag’ the causal cell type in the disease tissue, as cell-type programs derived from cells of the same type across tissues were found to be highly correlated (Fig. 4e), with consistent enrichment in these correlated cell-type programs (Extended Data Fig. 3 and Supplementary Note).

We next turned to cases where both healthy and disease tissue have been profiled, allowing us to link disease GWASs to programs associated with disease-specific biology. Such understanding is especially important for identifying therapeutic targets associated with disease development rather than disease-onset mechanisms.

We first examined disease-dependent programs in MS and AD, where aberrant interactions between neurons and immune cells are thought to play an important role. We analyzed MS and AD GWAS data (Supplementary Table 2) along with cell-type, disease-dependent and cellular process programs from scRNA-seq of brains of healthy and MS46 or AD30 individuals (Fig. 5a,e and Supplementary Table 1). We considered brain enhancer–gene links, immune enhancer–gene links (because MS and AD are associated with both tissue types) and nontissue-specific enhancer–gene links (Extended Data Fig. 6) and detected the strongest enrichment results for the immune enhancer–gene links. In both MS and AD, disease-dependent programs in each cell type differed substantially from cell-type programs constructed from cells from healthy (average Pearson’s r = 0.16) or disease (average Pearson’s r = 0.29) samples alone (Extended Data Fig. 7). Furthermore, we confirmed that disease GWASs matched to the corresponding disease-dependent programs produced the strongest enrichments, although there was substantial cross-disease enrichment (Extended Data Figure 8).

In MS, there was enrichment in disease-dependent programs in GABA-ergic neurons and microglia (Fig. 5b and Extended Data Fig. 9), as well as in layer 2 and 3 glutamatergic neurons and the complement cascade (in multiple cell types; Fig. 5d). The specific enrichment of the GABA-ergic neuron, disease-dependent program (but not the healthy cell-type program) for MS is consistent with the observation that inflammation inhibits GABA transmission in MS81. The GABA-ergic disease-dependent program was enriched with hydrogen ion transmembrane transporter activity genes, whereas the GABA-ergic cell-type program was enriched in genes with general neuronal functions (Supplementary Data 10). The enrichment of the microglia disease-dependent program for MS is consistent with the role of microglia in inflammation and demyelination in MS lesions82,83 and highlights a contribution of microglia to both disease onset and response. The top driving genes for the microglia disease-dependent program enrichment included MERTK (ranked 2) and TREM2 (ranked 4), both having roles in myelin destruction in MS patients84,85. Supporting this finding, there is a significant increase in the number of microglia (P = 2 × 10−4, Fisher’s exact test) and a significant decrease in number of glutamatergic neurons (P = 8 × 10−5) in MS lesions (Fig. 5c and Supplementary Data 11).

In AD, all associations highlighted the central role of microglia, suggesting that different processes may be at play in microglia or microglia subsets in healthy brains and after disease initiation: only the microglia disease-dependent program was enriched out of eight disease-dependent programs tested (Fig. 5e,f and Extended Data Fig. 10), along with the healthy microglia program and the apelin signaling pathway, disease-specific cellular process program (intercell type: GABA-ergic neurons and microglia). The microglia program enrichments are consistent with the contribution of microglia-mediated inflammation to AD progression86. Supporting this finding, there is a significant increase in the number of microglia in AD, brain (Fig. 5g and Supplementary Data 11).

Thus, in both MS and AD, heritability was enriched in distinct ways in microglia cell-type, disease-dependent and cellular process programs, suggesting therapeutic opportunities to combat the role of microglia in varying contexts for disease risk.

Linking enterocytes and M cells to UC

We next examined the role of cell-type, disease-dependent and cellular process programs in UC, where failure to maintain the colon’s epithelial barrier results in chronic inflammation. We analyzed UC and IBD GWAS data (Supplementary Table 2) with healthy cell-type, UC disease-dependent and UC cellular process programs constructed from scRNA-seq of healthy colon and from matched uninflamed and inflamed colon of UC patients (Fig. 6a and Supplementary Table 1). We compared colon enhancer–gene links (Fig. 6) and nontissue-specific enhancer–gene links (Extended Data Fig. 6) and detected the strongest enrichment results for the colon enhancer–gene links. As in MS and AD, UC disease-dependent programs in each cell type differed substantially from the corresponding healthy or disease colon cell-type programs (average Pearson’s r = 0.24; Extended Data Fig. 7 and Supplementary Data 12).

In addition to previously observed enrichments in healthy immune cell-type programs, our analysis highlighted healthy cell-type programs of enteroendocrine and endothelial cells, disease-dependent programs of enterocytes and M cells, as well as the complement cascade (in plasma, B cells, enterocytes and fibroblasts), MHC-II antigen presentation (macrophages, monocytes and DCs) and epidermal growth factor receptor 1 (EGFR-1) signaling (macrophages and enterocytes) in both healthy and disease cells (Fig. 6, Extended Data Fig. 3 and Supplementary Data 1). The strong enrichment in endothelial cells, which comprise the gut vascular barrier, is consistent with their rapid changes in UC87; the top driving genes included members of the tumor necrosis factor-α signaling pathway (EFNA1, NFKBIA and CD40, ranked 18, 26 and 29, respectively), a key pathway in UC88.

The disease-dependent programs (Fig. 6c, Table 1 and Extended Data Figs. 9 and 10) highlighted M cells, a rare cell type in healthy colon that increases in UC34 (Fig. 6d and Supplementary Data 11). M cells surveil the lumen for pathogens and play a key role in immune–microbiome homeostasis89. Supporting this finding, mutations in FERMT1, a top driving gene in the M-cell disease-dependent program (ranked 3), cause Kindler’s syndrome, a monogenic form of IBD with UC-like symptoms90. Notably, there was no enrichment in M-cell healthy cell-type programs (Fig. 6b), emphasizing that M cells are activated specifically in UC disease, as their proportions increase (P = 0.008; Fig. 6d).

Immune and connective tissue cell types linked to asthma

We analyzed GWAS data for asthma, idiopathic pulmonary fibrosis (IPF), COVID-19 (both general COVID-19 and severe COVID-19) and lung capacity (Supplementary Table 2) with healthy cell-type, disease-dependent and cellular process programs from asthma, IPF, COVID-19 and healthy29 (lower lung lobes) tissue scRNA-seq (Fig. 7a,c,f, Supplementary Figs. 13d–f and 15 and Supplementary Data 12), using either lung enhancer or immune enhancer–gene links. For asthma, there was significant enrichment for healthy cell-type and disease-dependent programs in T cells (Supplementary Note). For lung capacity (height-adjusted forced expiratory volume in 1 s (FEV1adj), relaxed vital capacity (RVC)), there was significant enrichment for healthy cell-type and disease-dependent programs in fibroblasts (Fig. 7b and Supplementary Data 1) and the MAPK cellular process program (in basal, club, fibroblast and endothelial cells) (Fig. 7f,g and Table 1). Genes driving these enrichments and enrichment results for IPF and COVID-19 are detailed in Supplementary Note.

Discussion

Previous work on identifying disease-critical tissues and cell types by combining expression profiles and human genetics signals has largely focused on the direct mapping of the expression of individual genes34 and genome-wide polygenic signals18,36 to discrete cell categories. Our study demonstrates that there is much to be gained by linking inferred representations of the underlying biological processes beyond cell types in different cell and tissue contexts with genome-wide polygenic disease signals, by integrating scRNA-seq, epigenomic and GWAS datasets.

Our work introduces three main conceptual advances: first, by integrating scRNA-seq data and GWAS summary statistics using tissue-specific enhancer–gene-linking strategies, we detect subtle differences in SNP-to-gene mapping between tissues which, on aggregation over the full GWAS signal, produce strong differences in disease heritability across cell types. Second, by constructing disease-dependent programs comparing cells of the same type in disease versus healthy tissue, we project GWAS signals across disease-specific cell states. Third, by using non-negative matrix factorization (NMF) to construct cellular process programs that do not rely on known cell-type categories, we identify cellular mechanisms that vary across a continuum of cells of one type or are shared between cells of different types, such as the mitogen‑activated protein kinase (MAPK) signaling pathway identified in the lung.

Leveraging these advances, we identified notable enrichments (Table 1) that have not previously been identified using GWAS data and are biologically plausible but not clearly expected, thus supporting the potential of the sc-linker to identify new knowledge. We also observed patterns across datasets that offer additional insights. For example, we observed that disease-dependent programs, but not healthy cell-type programs, of epithelial cells (M cells and basal cells) tend to be enriched in autoimmune diseases (UC and asthma). In contrast, for immune cells, healthy and disease-dependent programs tended to be similarly enriched. We posit that this suggests a role for epithelial cells in development, rather than initiation, of disease. Future studies are required to experimentally validate these hypotheses.

Our work has several limitations that highlight directions for future research. First, the cell types and states covered in this work are not exhaustive, and there will continue to be other cell types and more granular cell states uncovered as the scale of sequencing continues to grow. Second, the enhancer–gene-linking strategies can continue to be improved beyond the Roadmap and ABC models incorporated here. Finally, we focus on genome-wide disease heritability (rather than a particular locus); however, our approach can be used to implicate specific genes and gene programs. Additional limitations are discussed in Supplementary Note.

Looking forward, the gene program–disease links identified by our analyses can be used to guide downstream studies, including designing systematic perturbation experiments91,92 in cell and animal models for functional follow-up. In the long term, with the increasing success of phenome-wide association studies and the integration of multimodal single-cell resolution epigenomics, this framework will continue to be useful in identifying biological mechanisms driving a broad range of diseases.

Methods

Ethical approval

This research complies with all relevant ethical regulations, and the research protocols are approved by the Harvard School of Public Health.

ScRNA-seq data pre-processing

All scRNA-seq datasets in the present study25,26,27,28,29,30,34,42,43,44,45,46,47,48 are publicly available cell-by-gene expression matrices that are aligned to the hg38 human transcriptome (Supplementary Table 1). Each dataset included metadata information for each cell, describing the total number of reads in the cell and which sample the cell corresponds to and, if applicable, its disease status. We transformed each expression matrix to a count matrix by reversing any log(normalization) processing (because each downloaded dataset contained (1) raw counts, (2) normalized log2(TP10K) or (3) normalized log10(TP10K), where TP10K is transcripts per 10,000 transcripts) and standardized the normalization approach across all datasets to account for differences in sequencing depth across cells by normalizing the total number of unique molecular identifiers (UMIs) per cell, converting to TP10K and taking the log of the result to obtain log(10,000 × UMIs/total UMIs + 1), with ‘log2(TP10K + 1)’ as the final expression unit.

Dimensionality reduction, batch correction, clustering and annotation of scRNA-seq

The log2(TP10K + 1) expression matrix for each dataset was used for the following downstream analyses. For each dataset, we identified the top 2,000 highly variable genes across the entire dataset using Scanpy’s93 v.1.7.1 highly_variable_genes function with the sample ID as input for the batch. We then performed a principal component analysis (PCA) with the top 2,000 highly variable genes and identified the top 40 principal components (PCs), beyond which negligible additional variance was explained in the data (the analysis was performed with 30, 40 and 50 PCs and was robust to this choice). We used Harmony94 v.0.1.1 for batch correction, where each sample was considered its own batch. Subsequently, we built a k-nearest neighbors graph of cell profiles (k = 10) based on the top 40 batch-corrected components computed by Harmony and performed community detection on this neighborhood graph using the Leiden graph clustering method95 with resolution 1. For each dataset, individual single-cell profiles were visualized using the Uniform Manifold Approximation and Projection (UMAP)96. If previous annotations were available, they are used as a reference to annotate each cell in each dataset. If previous annotations were not available, we used established cell-type-specific expression signatures and gene markers described in the data source to annotate cells at the resolution of Leiden clusters.

Cell-type gene programs

We constructed cell-type programs for every cell type in a given tissue by applying a nonparametric Wilcoxon’s rank-sum test for differential expression (DE) between each cell type versus other cell types and computed a P value for each gene. Using a previously published strategy15, we transform these P values to X = −2log(P), which follows a $$\chi _2^2$$ distribution; these transformed values are converted to a grade between 0 and 1 using the minimum/maximum (min/max) normalization g = (X – min(X))/(max(X) – min(X)), resulting in a relative weighting of genes in each program. We note that these scores do not formally represent probabilities. In brief, cell-type programs constructed from healthy cells were termed healthy cell-type programs, and similarly cell-type programs constructed from disease cells were termed disease cell-type programs.

Disease-dependent gene programs

We constructed disease-dependent programs for each cell type observed in both healthy and matching disease tissue. For each cell type, we computed a gene-level, nonparametric, Wilcoxon’s rank-sum DE test between cells from healthy and those from disease tissues of the same cell type. The P values for each gene were transformed to a grade between 0 and 1 using the same strategy as in the cell-type program to form a relative weighting of genes in each program. In the COVID-19 BAL scRNA-seq, we also constructed viral progression programs based on DE between virally infected and uninfected cells of the same cell type in individuals with COVID-19. We observed low correlation between healthy cell-type gene programs and disease-dependent gene programs (Supplementary Fig. 13 and Supplementary Data 12).

Cellular process gene programs

Using latent factors derived from NMF97 (see below), we defined a cellular process program based on genes with high correlation (across cells) between their expression in each cell and the contribution of the factor to each cell (collapsing latent factors with high correlation). The correlations were transformed to a continuous-valued scale (between 0 and 1) by scaling their values (negative correlations are assigned to 0). We then annotated each factor (program) by the pathway most enriched in the top driving genes for the factor and labeled each as an ‘intracell type’ or ‘intercell type’ latent factor if the pathway was highly correlated with only one or multiple cell-type programs, respectively.

We constructed cellular process programs using an unsupervised approach, by applying NMF97 to the scRNA-seq cells-by-genes matrix. The solution to this formulation can be identified by solving the following minimization problem:

$$\begin{array}{l}{\mathrm{argmin}}\Bigg\{ \frac{1}{2}\left\| {X_{n,m} - \mathop {\sum}\limits_p {W_{\left\{ {n,p} \right\}}} \times H_{p,m}} \right\|_F^2 + (1 - \alpha )\frac{1}{2}\left\| {W_{n,p}} \right\| + \frac{1}{2}(1 - \alpha )\left\| {H_{p,m}} \right\|\\ + \alpha \left\| {{\mathrm{vec}}(W_{n,p})} \right\|_1 + \alpha \left\| {{\mathrm{vec}}(H_{p,m})} \right\|_1 \Bigg\}\end{array}$$
(1)

where Xn,m represents the log(normalized) expression of gene m in sample n, Wn,p denotes the grade of membership of latent factor p in sample n and Hp,m represents the factor weight of factor p in gene m. NMF identifies cellular processes as latent factors with a grade of contribution to each cell. For each dataset, we specified the number of latent factors p to be the number of annotated cell types in the dataset + 10. For each latent factor, we define a cellular process gene program by identifying genes with high correlation (across cells) between expression in a cell and the contribution of each factor to each cell. Latent factors with correlation >0.8 are collapsed to only consider a single latent factor. We annotated each cellular process program by the pathway most enriched (calculated with the Enrichr database and Fisher’s exact test P value) in the genes with highest correlation (across cells) between expression levels and factor weights (H) underlying the cellular process program (not necessarily the most highly expressed genes; Supplementary Fig. 17) and labeled it as an ‘intracell-type’ or ‘intercell-type’ cellular process program if highly correlated with only one or multiple cell-type programs, respectively.

Cellular process gene programs constructed from healthy and disease tissues

For scRNA-seq from healthy and disease tissue contexts, we proposed a modified NMF approach to construct gene programs that are shared across both tissues, specific to either healthy tissue or disease tissue. Let $$H_{P \times N_1}$$ be the observed gene expression data for a tissue T from a healthy individual and $$D_{P \times N_2}$$ be the observed gene expression data for the corresponding tissue from a disease individual. P is the number of features (genes), and N1 and N2 denote the number of samples from the healthy and disease tissues, respectively.

We assume an NMF for H and D as follows:

$$H_{P{{{\mathrm{\times}}}}N_1} \approx \left[ {L_{P{{{\mathrm{\times}}}}K_{\mathrm{C}}}^{{\mathrm{CH}}}L_{P{{{\mathrm{\times}}}}K_{\mathrm{H}}}^{{\mathrm{UH}}}} \right]F_{(K_{\mathrm{C}} + K_{\mathrm{H}}){{{\mathrm{\times}}}}N_1}^{\mathrm{H}}\,{\mathrm{where}}\,L^{{\mathrm{CH}}},L^{{\mathrm{UH}}},F^{\mathrm{H}} > 0$$
(2)
$$D_{P{{{\mathrm{\times}}}}N_2} \approx \left[ {L_{P{{{\mathrm{\times}}}}K_{\mathrm{C}}}^{{\mathrm{CD}}}L_{P{{{\mathrm{\times}}}}K_{\mathrm{D}}}^{{\mathrm{UD}}}} \right]F_{(K_{\mathrm{C}} + K_{\mathrm{D}}){{{\mathrm{\times}}}}N_2}^{\mathrm{D}}\,{\mathrm{where}}\,L^{{\mathrm{CD}}},L^{{\mathrm{UD}}},F^{\mathrm{D}} > 0$$
(3)

where KC is the number of shared programs between the healthy and the disease samples, KH is the number of healthy specific programs and KD is the number of disease-specific programs. LCH and LCD are used to denote the shared programs between healthy and disease states. Therefore, we assume that LCH is very close to LCD but not exact to account for other factors such as experimental conditions perturbing the estimates slightly. On the other hand, LUH and LUD are used to denote the healthy specific and disease-specific programs, respectively. FH and FD denote the program weights in the healthy and disease samples, respectively. This framed in the form of the following optimization problem:

$$\begin{array}{l}\mathop {{{{{\mathrm{argmin}}}}}}\limits_{L^{\mathrm{H}},L^{\mathrm{D}},F^{\mathrm{H}},F^{\mathrm{D}}} \frac{1}{2}\left\| {H - L^{\mathrm{H}}F^{\mathrm{H}}} \right\|_F^2 + \frac{1}{2}\left\| {D - L^{\mathrm{D}}F^{\mathrm{D}}} \right\|_F^2\\ + \frac{\mu }{2}\left( {\left\| {L^{\mathrm{H}}} \right\|_F^2 + \left\| {L^{\mathrm{D}}} \right\|_F^2} \right) + \frac{\gamma }{2}\left( {\left\| {L^{{\mathrm{CH}}}} \right\|_F^2 - \left\| {L^{{\mathrm{CD}}}} \right\|_F^2} \right)\end{array}$$
(4)

where $$L^{\mathrm{H}} = \left[ {L_{P{{{\mathrm{X}}}}K_{\mathrm{C}}}^{{\mathrm{CH}}}L_{P{{{\mathrm{X}}}}K_{\mathrm{H}}}^{{\mathrm{UH}}}} \right]\,{\mathrm{and}}\,L^{\mathrm{D}} = \left[ {L_{P{{{\mathrm{X}}}}K_{\mathrm{C}}}^{{\mathrm{CD}}}L_{P{{{\mathrm{X}}}}K_{\mathrm{D}}}^{{\mathrm{UD}}}} \right]{\mathrm{and}}\,\gamma$$ is a tuning parameter that controls how close LCH is to LCD and μ represents a tuning parameter that controls for the size of the loadings and the factors.

To determine the multiplicative updates of the NMF optimization problem in equation (4), we compute the derivatives of the optimization criterion with respect to each parameter of interest. We call the optimization criterion Q:

$$\nabla Q\left( {L^{\mathrm{H}}} \right) = - HF^{{\mathrm{H}}^T} + L^{\mathrm{H}}F^{\mathrm{H}}F^{{\mathrm{H}}^T} + \mu L^{\mathrm{H}} - \gamma [L^{{\mathrm{CD}}}0]$$
(5)
$$\nabla Q\left( {L^{\mathrm{D}}} \right) = - DF^{{\mathrm{D}}^T} + L^{\mathrm{D}}F^{\mathrm{D}}F^{{\mathrm{D}}^T} + \mu L^{\mathrm{D}} - \gamma [L^{{\mathrm{CH}}}0]$$
(6)
$$\nabla Q\left( {F^{\mathrm{H}}} \right) = - L^{{\mathrm{H}}^T}H + L^{{\mathrm{H}}^T}L^{\mathrm{H}}F^{\mathrm{H}}$$
(7)
$$\nabla Q\left( {F^{\mathrm{D}}} \right) = - L^{{\mathrm{D}}^T}D + L^{{\mathrm{D}}^T}L^{\mathrm{D}}F^{\mathrm{D}}$$
(8)

Following the multiplicative update rules of NMF as per Lee and Seung97, we get the following iterative updates and assume convergence has been achieved after 100 iterations or when the reconstruction error is below a user-specified error threshold (here the threshold is taken to be 1 × 10−4):

$$L_{ij}^{\mathrm{H}} \leftarrow L_{ij}^{\mathrm{H}}\frac{{\left( {HF^{{\mathrm{H}}^T} + \gamma \left[ {L^{{\mathrm{CD}}}0} \right]} \right)_{ij}}}{{\left( {L^{\mathrm{H}}F^{\mathrm{H}}F^{{\mathrm{H}}^T} + \mu L^{\mathrm{H}}} \right)_{ij}}}$$
(9)
$$L_{ij}^{\mathrm{D}} \leftarrow L_{ij}^{\mathrm{D}}\frac{{\left( {DF^{{\mathrm{D}}^T} + \gamma [L^{{\mathrm{CH}}}0]} \right)_{ij}}}{{\left( {L^{\mathrm{D}}F^{\mathrm{D}}F^{{\mathrm{D}}^T} + \mu L^{\mathrm{D}}} \right)_{ij}}}$$
(10)
$$F_{ij}^{\mathrm{H}} \leftarrow F_{ij}^{\mathrm{H}}\frac{{\left( {L^{{\mathrm{H}}^T}H} \right)_{ij}}}{{\left( {L^{{\mathrm{H}}^T}L^{\mathrm{H}}F^{\mathrm{H}}} \right)_{ij}}}$$
(11)
$$F_{ij}^{\mathrm{D}} \leftarrow F_{ij}^{\mathrm{D}}\frac{{\left( {L^{{\mathrm{D}}^T}D} \right)_{ij}}}{{\left( {L^{{\mathrm{D}}^T}L^{\mathrm{D}}F^{\mathrm{D}}} \right)_{ij}}}$$
(12)

Genomic annotations and the baseline-LD models

We define an annotation as an assignment of a numeric value to each SNP in a predefined reference panel (for example, 1000 Genomes Project98; see Data availability). Binary annotations can have a value of 0 or 1 only, continuous-valued annotations can have any real value and our focus is on continuous-valued annotations with values between 0 and 1. Annotations that correspond to known or predicted functions are referred to as functional annotations. The baseline-LD model40,41 (v.2.1) contains 86 functional annotations (Data availability), including binary coding, conserved and regulatory annotations (for example, promoter, enhancer, histone marks, transcription factor-binding site) and continuous-valued LD-related annotations.

S-LDSC

S-LDSC assesses the contribution of a genomic annotation to disease and complex trait heritability11. It assumes that the per-SNP heritability or variance of effect size (of standardized genotype on trait) of each SNP is equal to the linear contribution of each annotation.

$${\mathrm{var}}\left( {{{{\mathrm{{\ss}}}}}_j} \right) = \mathop {\sum}\limits_c^C {a_{jc}t_c}$$
(14)

where ajc is the value of annotation c at SNP j, with the annotation either continuous or binary (0/1), and tc is the contribution of annotation c to per-SNP heritability conditional on the other annotations. S-LDSC estimates tc for each annotation using the following equation:

$$E\left( {X_j^2} \right) = N\mathop {\sum}\limits_c l \left( {j,c} \right)t_c + 1$$
(15)

where $$l\left( {j,c} \right) = \mathop {\sum}\nolimits_k {a_{ck}r_{jk}^2}$$ is the stratified LD score of SNP j with respect to annotation c, rjk is the genotypic correlation between SNPs j and k computed using 1000 Genomes Project, and N is the GWAS sample size.

We assess the informativeness of an annotation c using two metrics. The first metric is the enrichment score (E-score), which relies on the enrichment of annotation c (Ec), defined for binary annotations as follows (for binary and continuous-valued annotations only):

$$E_c = \frac{{\frac{{h_g^2(c)}}{{h_g^2}}}}{{\frac{{\mathop {\sum}\nolimits_j {a_{jc}} }}{M}}}$$
(16)

where $$h_g^2\left( c \right)$$ is the heritability explained by the SNPs in annotation c, weighted by the annotation values where M is the total number of SNPs on which this heritability is computed (5,961,159 in our analyses). The E-score is defined as the difference between the enrichment for annotation c corresponding to a particular program against an SNP annotation for all protein-coding genes with a predicted enhancer–gene link in the relevant tissue. The E-score metric generalizes to continuous-valued annotations with values between 0 and 1 (ref. 102). We primarily focus on the P value for nonzero E-score >2. We chose the threshold of 2 because it is a round number that is roughly the geometric mean of the value of 1 (no enrichment) and the median value of 3.7 among the notable enrichments highlighted in Table 1.

The second metric is standardized effect size (τ*), the proportionate change in per-SNP heritability associated with a 1 s.d. in the value of the annotation, conditional on other annotations included in the model40:

$$\tau _c^ \ast = \frac{{\tau _c{\mathrm{sd}}_c}}{{h_g^2/M}}$$
(17)

where sdc is the s.e.m. of annotation c, $$h_g^2$$ is the total SNP heritability and M is as defined previously. $$\tau _c^ \ast$$ is the proportionate change in per-SNP heritability associated with an increase of 1 s.d. in the value of a annotation.

We assessed the statistical significance of the enrichment score and τ* via block-jackknife, as in previous work11, with significance thresholds determined via false discovery rate (FDR) correction (q-value < 0.05)103. The FDR was calculated over all relevant relatively independent traits for a tissue and all programs of a particular type (cell-type programs, disease-dependent programs, cellular process programs) derived from that tissue. We used the P value for nonzero enrichment score as our primary metric, because τ* is often nonsignificant for small cell-type-specific annotations when conditioned on the baseline-LD model104.

MAGMA gene-level and GSEAs

MAGMA assesses the enrichment of genes and gene sets with disease. MAGMA v.1.08 was run using a 0-kb window around each gene to link SNPs to genes, using all default MAGMA parameters for running the gene-level analysis, and using the 1000 Genomes reference panel for the genotype LD reference. For the gene-set-level analysis, two types of analysis were performed: (1) a binary gene-set analysis by thresholding the gene programs at different thresholds of program score (ranging from 0.2 to 0.95) (using the --set-annot flag in MAGMA) and (2) a continuous variable-based analysis by treating the gene program probabilistic grade or −log(odds) of the probabilistic grade as continuous gene-level variables (using the --gene-covar flag in MAGMA).

GWAS summary statistics

We analyzed publicly available GWAS summary statistics for 60 unique diseases and traits with genetic correlation <0.9. Each trait passed the filter of being well powered enough for heritability studies (z-score for observed heritability >5 as in previous work including Finucane et al.18). We used the summary statistics for SNPs with minor allele count >5 in a 1000 Genomes Project European reference panel98. The lung FEV1:forced vital capacity (FVC) trait was corrected for height data. For COVID-19, we analyzed two phenotypes: general COVID-19 (Covid versus population, liability scale heritability, h2 = 0.05, s.e.m. = 0.01) and severe COVID-19 (hospitalized Covid versus population, liability scale heritability, h2 = 0.03, s.e.m. = 0.01)105 (meta-analysis round 4, 20 October 2020: https://www.covid19hg.org/).

Computing a sensitivity/specificity index

We define a sensitivity/specificity index to benchmark (1) sc-linker versus MAGMA gene-set enrichment analysis (GSEA) and (2) different versions of the sc-linker corresponding to varying ways to define cell-type programs and SNP-to-gene linking strategies.

For the comparison of the sc-linker with MAGMA, we define the sensitivity/specificity index as the difference of (1) the average of −log10(P) of enrichment score (association) using the sc-linker (MAGMA) for ‘expected enrichments’ (gene program, trait) combinations (sensitivity) and (2) the average of −log10(P) of GSEA (association) using the sc-linker (MAGMA) for ‘other enrichments’ (gene program, trait) combinations (specificity). In Fig. 4e, the expected enrichment combinations include immune programs for blood cell traits and immune diseases, and brain programs for brain-related traits49,50; all other combinations are considered to be other enrichments. In Supplementary Fig. 8, the expected enrichment combinations include B and T cells for lymphocyte percentage, monocytes for monocyte percentage, megakaryocytes for platelet count, erythroid for red blood cell (RBC) counts and RBC distribution width; all other combinations of cell types and traits are considered to be other enrichments49,50. A limitation of the sensitivity/specificity index is that other enrichments may be biologically real in some cases; thus, we also consider sensitivity to detect expected enrichments.

For the comparison of the different versions of the sc-linker approach using either varying definitions of cell-type programs (Supplementary Figs. 6 and 7) or different ways to link SNPs to genes beyond Roadmap–ABC enhancer–gene-linking strategy (Fig. 3d,e and Supplementary Fig. 3), we use a slightly different definition of sensitivity/specificity index. Instead of the −log10(P) value, we use the τ* metric from the S-LDSC method, which evaluates conditional information in the SNP annotation corresponding to a gene program, corrected for the annotation size. This metric is preferred when comparing across cell-type programs or enhancer–gene-linking strategies that are widely different in their corresponding SNP annotation sizes, as is the case in these comparisons (we note that use of this metric is not possible in comparisons involving MAGMA, which does not estimate τ*).

Identifying genes driving heritability enrichment

For each gene program, we first subset the full gene list to only consider genes with >80% probability grade of membership in the gene program. Subsequently, we ranked all remaining genes using MAGMA (v.1.08) gene-level significance score and considered the top 50 ranked genes for further downstream analysis, which is different from the top 200 genes used for a ‘baseline’ method for scoring cell-type enrichments for disease that we used as a benchmark for sc-linker.

Identifying statistically significant differences in cell-type proportions

To identify changes in cell-type proportions between healthy and disease tissue, we used a multinomial regression test to jointly test changes across all cell types simultaneously. This helps account for all cell-type changes simultaneously, because an increase in the number of cells of one cell type implies that fewer cells of the other cell type will be captured. This regression model and the associated P values were calculated using the multinom function in the nnet v.7.3–17R package.

Statistics and reproducibility

All data used in the present study were generated and designed by the original studies in which they appear. No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment. All sc-linker heritability enrichment and significance P values are computed using a one-sided S-LDSC test. Multiple hypothesis correction was performed at the level of each scRNA-seq dataset across all cell-type and disease pairs.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.