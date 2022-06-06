SCAVENGE methodology

The framework of the SCAVENGE method is schematized in Fig. 1. In brief, SCAVENGE employs a network propagation strategy to explore transitive associations of a subset of cells that are highly relevant to traits of interest in the cell-to-cell network. The workflow is described in detail in the following steps.

Cell-to-cell similarity network construction

SCAVENGE uses a mutual k nearest neighbor (M-kNN) graph to faithfully represent inherent relationships of individual cells. We start from the feature-by-cell matrix from scATAC-seq profiles and use latent semantic indexing (LSI)59,60,61,62 to extract representative lower dimensions. Specifically, the binarized sparse matrix is first converted into a term frequency-inverse document frequency (TF-IDF) matrix by weighting the matrix against the total number of features for each cell with the following formula:

$$w_{i,j} = tf_{i,j} \times {{{\mathrm{log}}}}\left( {1 + \frac{N}{{df_i}}} \right),$$

where w i,j is the weight for the feature i in cell j; tf i,j indicates the term frequency that is the number of feature i in cell j; df i is the document frequency of term i that is number of cells where the feature i appears; and N is the total number of cells in the experiment.

The singular value decomposition (SVD) is applied on the TF-IDF matrix to generate an LSI score matrix with a lower-dimensional space as follows:

$$X = U_{m,k}{{\Sigma }}_{k,k}V_{n,k}^T.$$

This is the decomposition of X where U and V are orthogonal matrices, and Σ is a diagonal matrix. m represents the number of rows and n represents the number of columns for X. \(U = [\mu _1, \ldots ,\mu _k]\) is the left singular vector and μ i with length m. \(V = [v, \ldots ,v_k]\) is the right singular vector and v i with length n. \({{\Sigma }}_{k,k} = diag\left( {\sigma _1, \ldots ,\sigma _k} \right)\) and \(\sigma _1 \ge \sigma _1 \ge \ldots \ge \sigma _k\) are singular values of X.

Next, SCAVENGE builds a nearest neighbor graph from the LSI matrix of N cells and d leading LSIs (d = 30). The Euclidean distance between any pair of cells is calculated based on the LSI matrix, and k nearest neighbors (k = 30) for each cell are identified. To rigorously ensure cells connected with the same phenotype or state, we construct an M-kNN graph by requiring the node (cell) pair in the graph that are mutually the k nearest neighbors to each other18,63. If a cell fails to find its mutually k nearest neighbors, we connect this cell to its nearest neighbor to guarantee the connectivity of the resulting graph. The M-kNN graph enables prioritization of kNN structure and allows each cell to have, at most, k neighbors. It could avoid the generation of extreme hubs that have a large number of neighbors and also ensure the sparsity of the resulting graph18,63.

Definition of seed cells

Given the sparsity of single-cell genomic data, only a few of cells in the top ranks evaluated by the co-localization approaches could reliably reflect the relevance to a phenotype/trait/disease of interest (Extended Data Fig. 1b and Fig. 2a). We define the seed cells as a set of cells that are most likely to be relevant to the tested trait. To identify the seed cells for a specific trait of interest, we use g-chromVAR to calculate a bias-corrected Z-score (confounding technical factors such as GC content bias and PCR amplification) for each cell13, by integrating PPs of genetic causal variants and their strength of chromatin accessibility. We realized that, because seed cells and their number can vary among tested genetic traits and the single-cell dataset employed, it is not suitable to pre-define a fixed number for seed cells that is optimal for all situations. As the Z-score generated from g-chromVAR is a normalized measurement that considers cells uniformly across all the cells, we reasoned that this can serve as an initial filter for seed cells. We convert Z-scores to P values using a one-tailed normal distribution and initially consider all the cells with P values less than 0.05 as seed cells. Our analysis showed that SCAVENGE is robust to a range of proportions for seed cell selection (Extended Data Figs. 2h and 3c). In practice, 5% of total cells is a number sufficient to represent seed cells. Therefore, we refine the seed cells by keeping the 5% cells with the highest Z-score if the number of initial seed cells exceeds 5%.

Network propagation with seed cells

SCAVENGE relies on the concept of network propagation, which is based on the guilt-by-association principle where the proximity between the set of seed nodes and all the nodes in the graph can be comprehensively measured. We introduce random walk with restart (RWR)64, a network propagation-based algorithm for propagating the set of seed cells for a trait of interest to discover the transitive associations hidden in the cell-to-cell graph.

In general, a random walk over a graph is a stochastic process. That means the initial state of the graph is known and its state changes over iterations (random walk) with a transition probability matrix that describes the probability for one node jumping to another. The initial state of the graph is defined by selected seed nodes (cells). Their information propagates to all nodes in the graph, and the graph finally reaches a stationary state after a series of random walk processes. The strength of transitive associations can be measured by information carried by each node at the stationary state of the graph. The stationary distribution is defined as the network propagation score. By leveraging the entire structure of the graph, the RWR algorithm allows the measurement of a cell influenced by seed cells from not only its direct neighborhood but also the distant immediate neighborhood that can be reached by multiple steps. Intuitively, the more a cell is influenced by the seed cells, the greater relevance it has to the phenotype/trait evaluated. As such, a higher network propagation score indicates stronger relevance to the trait evaluated.

More formally, there is a set of seed nodes \(I \in V\) defining the initial states of an undirected M-kNN graph, \(G = \left( {V,E} \right)\), that is constructed as described above. Two sparse matrices are created to represent graph structure.

$$A_{i,j} = \left\{ {\begin{array}{*{20}{l}} 1 \hfill & {if\;e_{i,j} \in E} \hfill \\ 0 \hfill & {otherwise} \hfill \end{array}} \right.$$

$$M_{i,j} = \frac{{A_{i,j}}}{{{{\Sigma }}_jA_{i,j}}},$$

where A i,j denotes adjacency matrix of G, and M i,j is the transition probability matrix that is column normalization of A i,j .

The random walk steps s is discrete and finite, \(s \in {\Bbb N}\). The information carried of node v at step s is v s . We considered that the information is equally distributed across the seed nodes in the initial state of step 0. We can write

$$v_0 = \left\{ {\begin{array}{*{20}{l}} {1/n(I)} \hfill & {if\;v \in I} \hfill \\ 0 \hfill & {otherwise} \hfill \end{array}} \right.,$$

where n(I) is the number of seed nodes.

At each iteration, a node can transfer the information to one of its randomly selected neighbors (the probability is proportional to the number of neighbors and stored in M i,j ) or restart at the node by transferring information back to itself. As G is undirected, highly connective and not a bipartite (that is, there does not exist two disjoint non-empty sets), the random walk on the graph is irreducible and aperiodic, and the iterative update of this procedure is guaranteed to converge to the stationary steady state65. The corresponding stationary distribution or probability for each node can be obtained by recursively applying the following equation until convergence: \(\forall \;i,j \in V\), \(\forall s \in {\Bbb N},\)

$$v_{s + 1}^T = \left( {1 - \gamma } \right)Mv_s^T + \gamma v_0^T,$$

where γ is the restart probability ranged from 0 to 1 (γ = 0.05). Technically, the restart probability serves as a damping factor on long walks and avoids the walk being trapped in a dead end. It is useful to ensure that the propagation process is confined to the local neighborhood, and random walks can converge to the stationary state. The random walk process is continued until steady state (\(|v_{s + 1} - v_s| < \alpha\), where α is 1 × 10−5), and stationary distribution v s is considered as the network propagation score.

TRS normalization and scale

The total sum of information is kept constant and equals 1 throughout the graph, while information spreads over the graph with each iteration. Therefore, the original network propagation scores are essentially very small (for example, the score for a cell that is relevant to the trait could be as small as 1 × 10−4), which makes these scores challenging to interpret (Extended Data Fig. 3b). Although the network propagation scores represent the trait-associated relevance, they are highly dependent on cell numbers in the evaluated dataset (that is, the number of nodes in the graph). This leads to another drawback: that the network propagation scores cannot be directly compared across different datasets (which would have different cell numbers) even if the same genetic trait is assessed and corresponding enrichments are observed. At the same time, we cannot determine significance for each cell from the network propagation score. We reasoned that appropriate processing of network propagation scores is needed to enable per-cell scores to be comparable across different traits but inherit the overall significant levels from our g-chromVAR analysis. To this end, we define the TRS by scaling and normalizing the network propagation score. Specifically, we first calculate the 99th percentile of network propagation scores and use it as the ceiling. This step makes sure that a few cells with high network propagation scores are put on the same level and avoids the effects of potential extreme outliers so that network propagation scores could be scaled up between 0 and 1 across all cells.

$$\overrightarrow {NP} _{ceiling} = {{{\mathrm{percentile}}}}(\overrightarrow {NP} ,0.99),$$

$$\overrightarrow {NP} _{scaled} = \frac{{\overrightarrow {NP} _{ceiling} - {{{\mathrm{min}}}}(\overrightarrow {NP} _{ceiling})}}{{{{{\mathrm{max}}}}(\overrightarrow {NP} _{ceiling}) - {{{\mathrm{min}}}}(\overrightarrow {NP} _{ceiling})}},$$

To match the final TRS to the significant level of the original dataset, a scaling factor representing the average levels of bias-corrected Z-score with the top 1% of cells is calculated by the following:

$$\overrightarrow {TRS} = \frac{{\overrightarrow {NP} _{scaled} \times \mathop {\sum }

olimits_{i = 1}^m Z_i}}{{{{\mathrm{m}}}}},$$

where m is a cell with the top 1% bias-corrected Z-score.

Cell state identification using the permutation test

To further assess whether a cell is enriched or depleted for the trait of interest, we propose a method to determine the statistical significance for individual cells by calculating an empirical distribution of scores per cell, instead of using a fixed cutoff arbitrarily (Extended Data Fig. 5c). Here, we focus on network propagation score rather than TRS because network propagation scores directly result from network propagation processes without further normalization and scale, and the sum of network propagation scores across all cells is constantly equal to 1 and independent of seed cell selection, such that the network propagation scores yielded from SCAVENGE analysis with different sets of seed cells are directly comparable. We use a permutation-based method to generate the empirical distribution. We randomly select a set of number-matched seed cells to repeat SCAVENGE analyses. To maintain consistency of topology attributes with real seed cells, we require the permuted seed cells to have the same degree distribution for each permutation. That means, if a seed cell has m neighbors in the cell-to-cell graph, the matched permuted one can be selected only from cells with m neighbors. The enriched cell is expected to have a larger network propagation score than that from permuted seed cells. The permutations can be repeated independently multiple times. For each cell, the significance can be determined by comparisons of real network propagation scores and those from permutations. The empirical P value is defined as \(\forall c \in \{ cell_1, \ldots ,cell_N\}\),

$$p_e^c = \frac{{1 + \mathop {\sum }

olimits_{i = 1}^B I\left( {NP^c \le NP_i^c} \right)}}{{1 + B}},$$

where B is the number of permutations (B = 1,000). The empirical P value is calculated as the proportion of the network propagation score of permutation greater than its real score. Trait-enriched cells are defined as cells with P less than 0.05.

Assessing performance with simulations

To evaluate the calibration and power of our method, we conducted simulations from downsampling a variety of FACS-sorted bulk hematopoietic populations13. The simulation framework uses an approach that has been described previously15. We started from a peak-by-cell count matrix generated from bulk ATAC-seq data. The read count of synthetic single cells for peak i in cell type t follows a binomial distribution \(binom(2,p_i^t)\), where \(p_i^t = (1 - q)r_i^t/2 + qn/2k\), \(r_i^t\) is the ratio of reads for peak i in cell type t from the bulk ATAC-seq data; k is the total number of peaks in the bulk data; n is the number of simulated fragments; and q specifies noise level (\(q \in [0,1]\)), where q = 0 is no noise, and q = 1 indicates the highest level of noise, which means a random distribution of n fragments into k peaks.

We used g-chromVAR to assess enrichment of the highly heritable trait of monocyte count across 16 hematopoietic cell types (Extended Data Fig. 2a). Next, we created ground truth datasets from the bulk samples, including monocytes and natural killer (NK) cells, to represent enriched and depleted cell populations, respectively. The sparsity of simulated datasets is similar to that observed in the real datasets (Extended Data Figs. 1a and 2b). To investigate how unbalanced cell compositions of simulations may affect SCAVENGE performance, we created a variety of synthetic datasets with different proportions of relevant cell populations. Precisely, 1,000 cells were synthesized for each simulation with the relevant cells (monocytes) composing between 10% and 90% of the population, with 10% as the gradient. The genetic variants associated with monocyte count are examined, and the metrics of area under the receiver operating characteristic (auROC), true-positive rate (TPR) and false-positive rate (FPR) are calculated across the simulations (Extended Data Fig. 2c–e). We found that SCAVENGE is robust to different cell compositions for both a uniform number of and unbalanced numbers of cells for different cell types used in the simulations. In addition, although ground truth is not available, robust and relevant enrichments via different real scATAC-seq datasets that include a distinct number of cells across different cell types were observed (Extended Data Fig. 3c), supporting the good performance of SCAVENGE with unbalanced cell type compositions in real datasets.

Given that cell numbers for specific cell types in real settings are undetermined and highly variable across datasets from different biological systems and conditions, we, thus, used a uniform cell number for simulation to facilitate further validation and interpretation (Fig. 2a,b and Extended Data Fig. 2h–k). We simulated 500 cells per labeled cell type with the parameters of n = 10,000 and q = 0.3 for the following benchmark analysis. Intuitively, the top-ranked 500 cells will be monocytes, whereas the bottom-ranked 500 cells will be NK cells if these cells are perfectly classified. Additional simulations are also generated to investigate the robustness of SCAVENGE. We set parameter n to various values, including 5,000, 7,500, 10,000, 25,000 and 50,000, to test the effects of sequencing depth. We set q to various values, including 0.25, 0.3, 0.35 and 0.4, to test the robustness to noise. To qualitatively assess the performance of SCAVENGE in a more complex situation, we also generated another dataset consisting of nine cell types that showed trait relevance at different levels, where 200 cells per labeled cell type were synthesized. SCAVENGE was applied to these simulated datasets using the default parameters, except for evaluation of the number of seed cells and the number of neighbors used for graph construction.

Application of SCAVENGE to the scATAC-seq datasets

Four independent datasets were used for SCAVENGE analysis as use-case examples in this study. All cell type annotations and metadata were obtained from the original studies unless we specifically state otherwise below.

The 10x Genomics PBMC dataset

We downloaded fragment files of this dataset from the 10x Genomics website (https://support.10xgenomics.com/single-cell-atac/datasets/1.0.1/atac_v1_pbmc_5k). This PBMC dataset includes 5,335 cells from one donor, and no cell annotations were provided. The dataset was processed by the standard ArchR pipeline with default parameters60, including Arrow files creation, quality control, inferring doublets, dimensionality reduction and clustering. We initially obtained eight cell clusters and kept six for those containing at least 50 cells in each cluster. We retained 4,562 cells for SCAVENGE analysis for the trait of monocyte count. Gene scores of several cell-type-specific marker genes were calculated based on chromatin accessibility in the vicinity of the gene and used to annotate PBMC populations.

Hematopoiesis scATAC-seq dataset

This hematopoietic cell dataset consists of 35,038 cells from two bone marrow mononuclear cell (BMMC) donors, three CD34+-enriched bone marrow cell (CD34+) donors and five PBMC donors. The data were processed as described in the original publication23. We downloaded the processed data in summarized experiments format from https://github.com/GreenleafLab/MPAL-Single-Cell-2019. Three cell types were removed owing to unknown cell labels. A total of 33,819 cells from 23 cell populations were selected for further analysis. The LSI-by-cell matrix with the first 30 leading LSIs is extracted for M-kNN graph construction. The peak-by-cell matrix is used as input for SCAVENGE analysis for 22 blood cell traits. The per-cell-based TRS is visualized with UMAP66 coordinates. For each tested blood trait, the TRSs from the same cell population collapsed into the median value to represent the TRS on the cell type level.

Hematopoiesis scATAC-seq dataset 2

This hematopoietic dataset consists of 63,882 cells from one BMMC donor, two CD34+ donors and 16 PBMC donors. The data were processed as described in the original publication28. We downloaded the processed peak-by-cell count matrix as well as cell annotations of this dataset from https://github.com/GreenleafLab/10x-scATAC-2019. A total of 63,882 cells of 31 cell types were used for SCAVENGE analysis for a variety of blood cell traits, which is similar to the above hematopoiesis scATAC-seq dataset. We also applied SCAVENGE on this dataset to explore the enrichment of ALL associations.

We also constructed a single-cell trajectory of B cell development to further examine how ALL risk is variably enriched along this trajectory. This trajectory consists of eight cell types from HSCs to progenitors to mature B cells. The pseudo-time for each cell in this trajectory was calculated as previously described28. To identify TFs correlated with genetic trait enrichments, we calculated the Spearman correlation between TF motif enrichment scores and SCAVENGE TRSs using all cells in the trajectory. The trajectory was divided into 100 equal bins along the pseudo-time. For each bin, we computed the gene activity as the proportion of cells that have non-zero values of gene scores. Gene activities for selected TFs were shown in the pseudo-time heat maps.

COVID-19 PBMC scATAC-seq dataset

This dataset comprises 97,315 PBMCs, obtained from three healthy donors and eight patients with COVID-19, of whom five had moderate disease and three had severe disease. The fragment files processed using the Cell Ranger pipeline were obtained from the authors of the original paper31 . We performed cell clustering and cell type annotation using the ArchR package60. We created Arrow files from fragment files and performed quality control with metrics including the number of unique fragments and enrichment of the transcription start site. Iterative LSI was performed with the ‘addIterativeLSI’ function. As batch effects in single-cell genomic data analysis remain a central challenge that can obscure the biological signal of interest, potential batch effects for both cell type annotation and cell-to-cell network construction need to be removed. We corrected potential sample-specific and other batch effects using the Harmony algorithm with the ‘addHarmony’ function67. At the same time, the Harmony-fixed LSI matrix was used to build the cell-to-cell graph for SCAVENGE analysis. We applied UMAP dimensionality reduction and Leiden clustering68 to the batch-corrected epigenomic datasets. Initially, 25 cell clusters were identified, and we merged similar cell clusters and annotated cell populations using gene scores of canonical markers from the original publication. For the resulting 15 cell types, we performed peak calling and generated the peak-by-cell matrix for SCAVENGE analysis of the COVID-19 severity-associated genetic variants. Given that the cell clustering analysis is performed by using the same Harmony-fixed LSI, and individual cells are grouped accordingly for peak calling, the resulting peak-by-cell matrix that is used for SCAVENGE analysis also benefits from Harmony analysis to account and correct for potential batch effects.

To explore the heterogeneity of trait-associated enrichments, we performed cell state discovery analyses as described above, and the cells were segregated into a severe COVID-19 risk variant-enriched population and a severe COVID-19 risk variant-depleted population. The number and proportion of these two cell states were investigated across individual cell types. We found that, in most cell types, the cell numbers across these two cell states are extremely different. We, thus, selected the same amount of cells that are most representative of each cell state for further analysis. In the case of CD14+ monocytes, 1,000 cells with the highest TRS in severe COVID-19 risk variant-enriched cells and 1,000 cells with the lowest TRS in severe COVID-19 risk variant-depleted cells were selected to explore the differences of TF motif enrichment in the peak region. The accessibility profiles of these 2,000 cells were used to compute gene scores for genes of interest. The corresponding genome browser accessibility tracks of single-cell-based occupancy and pseudo-bulk samples were plotted using the ‘plotBrowserTrack’ function.

GWAS summary statistics and fine-mapping analysis

Blood cell traits

Summary statistics of 22 blood cell traits from the Blood Cell Consortium 2 (BCX2) analysis were processed as previously described22. Variants with fine-mapped PP > 0.001 for a locus in one or more blood traits were retained and used for analyses.

COVID-19 severity

We obtained summary-level GWAS data of B1 (hospitalized COVID-19+ versus non-hospitalized COVID-19+) from the COVID-19 Host Genetics Initiative (release 5, https://www.covid19hg.org) with ancestry restricted to European individuals. This COVID-19 severity trait is from a meta-analysis of 13,641 moderate or severe COVID-19 hospitalized cases and 49,562 reported cases of SARS-CoV-2 infection. Given that only summary-level data were available in this instance (raw genotype-level data were not accessible), conditionally independent signals were first identified using GCTA-COJO69. In COJO, window size was set to 10 Mb, and the P value threshold was set to a suggestive level of 1.0 × 10−6 because of limited signal reaching genome-wide significance. Subsequently, approximate Bayesian factor (ABF) analyses were performed as described30 using a window size of 1 Mbp on either side of independent variants. The prior variance in allelic effects was estimated as 0.04, considered to be broadly appropriate for this method, and calculated using formula (8)30. For loci containing multiple independent signals, association statistics surrounding an index variant in question were based on corrected GCTA approximate conditional analysis adjusting for all other independent variants in that 1-Mbp either-side region. Finally, the PP of being causal was calculated by dividing the ABF of each variant by the sum of ABF values over all variants in the window. LocusZoom-style plots were created in R, using a 1000G European-subsetted reference panel for linkage disequilibrium (LD) information.

ALL predisposition

The GWAS data of childhood ALL were obtained from our previous study44. For causal variant identification, we performed fine-mapping at 13 well-replicated and three novel ALL risk loci identified in our recent trans-ancestry GWAS. In this instance, where raw genotype data were available, FINEMAP was used70. An LD matrix was created for 1 Mbp on either side of lead significant variants using an unrelated set of genotypes (third-degree relatives or closer), including all ancestry groups. FINEMAP was run in the stochastic search method, with all defaults in place, apart from –n-causal-snps=10, and the PPs of variants being causal were obtained. Due to substantial overlap at the BMI1–PIP42A locus, variants contributing more causal information (higher PP) were preferentially included.

The sparsity of scATAC-seq

To assess the sparsity of scATAC-seq data, we used five published datasets, including 10x PBMCs (n = 4,562), leukemic cells (Leukemia, n = 391)71, a mixture of GM12878 and HEK293T cells (GM12878vsHEK, n = 526)59,71, a mixture of GM12878 and HL-60 cells (GM12878vsHL, n = 597)59 and a mixture of breast tumor 4T1 cells (Breast_Tumor, n = 384)72. These datasets cover two commonly used scATAC-seq platforms of microfluidics (10x Genomics for PBMCs, Leukemia and Breast_Tumor) and cellular indexing (GM12878vsHEK and GM12878vsHL). The 10x PBMC dataset was obtained and processed as described above. The other four datasets were processed as previously reported73. We downloaded the h5ad files from https://github.com/jsxlei/SCALE and extracted peak-by-cell matrices, respectively. Two measures of sparsity were examined: (1) the sparsity of peaks, which indicates what proportion of cells that have an absence of signal for a given peak; and (2) the sparsity of cells, which indicates what proportion of peaks have an absence of signal for a given single cell. Peak calling is performed with a pseudo-bulk sample, which is generated by the aggregation of all single-cell profiles in each dataset, which implies every peak will present abundant signals in pseudo-bulk data. As the pseudo-bulk accessibility data are highly correlated to and resemble a bulk ATAC-seq experiment, we, reasoned that these two measurements could well represent the sparsity of individual cells compared to corresponding bulk or pseudo-bulk ATAC-seq data.

TF motif analysis

We used chromVAR38 to measure global TF activity. We used the peak-by-cell matrix and TF motifs within the non-redundant JASPAR 2018 CORE vertebrate dataset (n = 322) to compute bias-corrected deviation Z-scores for each cell. We compared motif enrichment Z-scores of cells with variable states by using Benjamini–Hochberg-corrected P values from one-sided Student’s t-tests.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.