SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks

Joint profiling of chromatin accessibility and gene expression in individual cells provides an opportunity to decipher enhancer-driven gene regulatory networks (GRNs). Here we present a method for the inference of enhancer-driven GRNs, called SCENIC+. SCENIC+ predicts genomic enhancers along with candidate upstream transcription factors (TFs) and links these enhancers to candidate target genes. To improve both recall and precision of TF identification, we curated and clustered a motif collection with more than 30,000 motifs. We benchmarked SCENIC+ on diverse datasets from different species, including human peripheral blood mononuclear cells, ENCODE cell lines, melanoma cell states and Drosophila retinal development. Next, we exploit SCENIC+ predictions to study conserved TFs, enhancers and GRNs between human and mouse cell types in the cerebral cortex. Finally, we use SCENIC+ to study the dynamics of gene regulation along differentiation trajectories and the effect of TF perturbations on cell state. SCENIC+ is available at scenicplus.readthedocs.io.

each direction and less significant peaks are iteratively filtered out.During this procedure peaks will be merged and depending on the number of peaks included into them, different processes will happen: 1) 1 peak: The original peak will be kept, 2) 2 peaks: The original peak region with the highest score will be kept and 3) 3 or more peaks: The original region with the most significant score will be taken, and all the original peak regions in this merged peak region that overlap with the significant peak region will be removed.The process is repeated with the next most significant peak (if it was not removed already) until all peaks are processed.This procedure will happen twice, first for each pseudobulk peak, and after peak score normalization to process all peaks together.We recommend using this set of regions downstream, as we have observed that using pseudobulk peaks improves signal compared to using bulk peaks across the whole population (specially for rare cell types, whose signal may be confused by noise while using the merged ATAC-seq profile of the whole population).In case of independent scATAC-seq data, the cell annotation can also be obtained from alternative methods, such as a preliminary clustering analysis using a predefined set of genome-wide regions/peaks (e.g.SCREEN 5 ) as input to identify cell populations.
-QC analysis and cell selection**: PycisTopic computes QC metrics at the sample-level and the barcode-level.Sample-level statistics can be used to assess the overall quality of the sample, while barcode level statistics can be used to differentiate good quality cells versus the rest.

Sample-level statistics include:
• Barcode rank plot: The barcode rank plot shows the distribution of nonduplicate reads and which barcodes were inferred to be associated with cells.
A steep drop-off ('knee') is indicative of good separation between the cellassociated barcodes and the barcodes associated with empty partitions.
• Insertion size: ATAC-seq requires a proper pair of Tn5 transposase cutting events at the ends of DNA.In the nucleosome-free open chromatin regions, many molecules of Tn5 will fragment the DNA; around nucleosome-occupied regions, Tn5 can only access the linker regions.Therefore, in a good ATAC-seq library, one should expect to see a sharp peak at the <100 bp region (open chromatin), and a peak at ~200bp region (mono-nucleosome), and other larger peaks (multi-nucleosomes).A clear nucleosome pattern indicates a good quality of the experiment.
• Sample TSS enrichment: The TSS enrichment calculation is a signal to noise calculation.The reads around a reference set of TSSs are collected to form an aggregate distribution of reads centered on the TSSs and extending to 1,000 bp in either direction (for a total of 2,000bp).This distribution is then normalized by calculating a fold change compared to 100bp at either side of the flanks of the distribution.
• FRiP distribution: Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads.A low FRIP indicates that many reads form part of the background, and so that the data is noisy.
• Duplication rate: A fragment is considered "usable" if it uniquely maps to the genome and remains after removing PCR duplicates (defined as two fragments that map to the same genomic position and have the same unique molecular identifier).The duplication rate serves to estimate the number of usable reads per barcode.High duplication rates may indicate over-sequencing or lack of fragments after transposition and encapsulation.

Barcode-level statistics include:
• Total number of (unique) fragments per cell-barcode • TSS enrichment per cell-barcode: The normalized coverage at the TSS position for each barcode (average +-10bp from the TSS).Noisy cells will have a low TSS enrichment.
• FRiP per cell-barcode: The fraction of reads in peaks for each barcode.Noisy cells have low FRIP values.However, this filter should be used with nuance, as it depends on the quality of the original peaks.For example, if there is a rare population in the sample, its specific peaks may be missed by peak calling algorithms, causing a decrease in their FRIP values.
-Count matrix generation*: PycisTopic can generate a fragment count matrix from the fragments files, a set of regions (preferably, consensus regions as previously explained) and a list of high quality cells.Alternatively, a precomputed count matrix can also be used as input.In this step a cisTopic object will be created, including the fragment counts, path/s to the fragments files (if used to generate the count matrix) and cell/region metadata.Scrublet 6 (v0.2.3),By default, and when dealing with 10x data sets, the expected doublet rate is set to 10%.
-Topic modelling algorithms and model selection*: PycisTopic implements two algorithms for topic modelling, serial LDA with a Collapsed Gibbs Sampler (as implemented in the lda module) and Mallet, which allows to parallelize the LDA model estimation.The same default parameters as in cisTopic 1 are used.
Following metrics for model selection are included in pycisTopic: • Mimno_2011: Uses the average model coherence as calculated by Mimno et al (2011) 7 .To reduce the impact of the number of topics, the average coherence based on the top values is used.Better models have a higher coherence.
• Log-likelihood: Uses the log-likelihood in the last iteration as calculated by Griffiths and Steyvers (2004)  8 , as used in cisTopic.Better models have a higher log-likelihood.
• Arun_2010: Uses a density-based metric as in Arun et al (2010) 9 using the topic-region distribution, the cell-topic distribution and the cell coverage.Better models have a lower score for this metric.The best model (i.e. the one with the optimal number of topics) is the model with the smallest number of topics where coherence (Minmo_2011) and Log-likelihood are maximized and Arun_2010 and Cao_Juan_2009 are minimized.
-Dimensionality reduction and batch effect correction**: Cells (or regions) can be clustered using the leiden algorithm and embedded using dimensionality reduction with UMAP and TSNE using the cell-topic (or topic region distributions).
In addition, harmonypy (v0.0.5) can be used on scaled cell-topic distributions to correct for batch effect between samples (see mouse cortex analysis).When working with single cell multiome data, it is possible to co-cluster and reduce dimensionality using both the scRNA-seq and scATAC-seq data by using UMAP to build fuzzy simplicial sets (similar to KNN graphs).
-Topic binarization and QC*: To perform motif analysis (and other downstream steps) topics have to be binarized into region sets rather than continuous distributions.Several binarization methods are included in pycisTarget (applicable for topic-region and cell-topic distributions): 'otsu' (Otsu, 1979)  11 , 'yen' (Yen et al.,   1995) 12 , 'li' (Li & Lee, 1993)  13 , 'aucell' (Van de Sande et al., 2020) 14 or 'ntop' (Taking the top n regions per topic).Otsu and Yen's methods work well for topic-region distributions; however, for some downstream analyses, it may be convenient to use 'ntop' to have balanced region sets (e.g.training of classification models).By default, pycisTopic uses Otsu thresholding for binarization, as it results in the largest number of regions per topic while limiting the amount of "noise" for motif enrichment analysis.For cell-topic distributions, we recommend using the AUCell method.In addition, pycisTopic includes new metrics to assess topic quality: • Number of assignments and regions/cells per binarized topic.
• Topic coherence (Mimno et al., 2011) 7 : Measures to which extent high scoring regions in the topic are co-accessible in the original data.If it is low, it indicates that the topic is rather random.Better topics have a higher score.
• The marginal topic distribution: Indicates how much each topic contributes to the model.Better topics have a higher score.
• The gini index: Value between 0 and 1, that indicates the specificity of topics (0: General, 1: Specific) -Drop-out imputation*: Thanks to the probabilistic nature of topic modelling, dropouts can be imputed by multiplying the cell-topic and topic-region distributions, resulting in a matrix with the probabilities of each region in each cell.
-Calculation of Differentially Accessible Regions (DARs)*: Using the imputed fragment matrix Differentially Accessible Regions, or DARs can be calculated.
Briefly, a Wilcoxon rank sum test is performed for user specified contrasts.By default, a DAR is defined as a region with LogFC > 0.5 and benjamini hochberg adjusted p values < 0.05.
-Gene activity and Differentially Accessible Genes (DAGs): The gene activity recapitulates the overall accessibility values in a space around the gene.
Differentially Accessible Genes (DAGs) can be derived from this matrix.The user can select among different options: • Search space: The user can choose whether the search space should include other genes or not (use_gene_boundaries), and the minimum and maximum distance it should have (upstream and downstream).Promoters can be excluded from the calculations, as they are usually ubiquitously accessible.
• Distance weight: The parameters weights the impact of distance when inferring region to gene weights as an exponential function.The user can control whether this weight should be used (distance_weight) and the effect of the distance (decay_rate).In addition, the user can choose from which distance to the gene body this weight is applied (extend_gene_body_upstream and extend_gene_body_downsstream) • Gene size weight: Large genes may have more peaks by chance.The user can optionally apply a weight based on the size of each gene (gene_size_weight), which by default is dividing the size of each gene by the median gene size in the genome.Alternatively, the user can also use average_scores which will calculate the gene activity as the mean weighted region accessibility of all regions linked to the gene.
• Gini weight: This weight will give more importance to regions that are highly specific (gini_weight).
Except for ingest, these methods return a common coembedding and labels are inferred using the distances between query and reference cells as weights.
-pyGREAT: pycisTopic makes GREAT (Genomic Regions Enrichment of Annotations Tool) 19 analysis automatic by constructing a HTTP POST request according to an input region set and automatically retrieving results from the GREAT web server.
-Signature enrichment: Given epigenomic signatures are intersected with the regulatory regions in the dataset and summarized into region sets.Using the imputed fragment matrix, all regions in each cell are ranked and the cell-specific rankings are used as input for AUCell.By default, we use 5% of the total number of regions in the dataset as a threshold to calculate the AUC.
-Export to loom files**: PycisTopic allows to export cisTopic object to loom files compatible with Scope for visualization 20 and SCopeLoomR, for importing pycisTopic analyses into R.

PycisTarget
PycisTopic unsupervisedly identifies groups of co-accessible regions (cis-regulatory topics) as well as cell type specific enhancers (DARs).These regulatory programs are used in the second step of SCENIC+, in which TFs and their potential target regions (i.e.cistromes) are identified using motif enrichment analysis.For this purpose, we have developed pycisTarget, a motif enrichment suite that combines different motif enrichment approaches such as cisTarget [21][22][23] and Homer 24 ; and a novel approach to compute Differentially Enriched Motifs between sets of regions called DEM.
-Homer: PycisTarget includes a wrapper to run Homer's findMotifsGenome.pl,allowing the identification of known and de novo motifs (by default using default Homer parameters).For identifying cistromes for each motif, found motifs are used as input for homer2 find.Known motifs are annotated according to the motif annotation in the SCENIC+ motif collection.To annotate de novo motifs, Tomtom 25 is run with the SCENIC+ motif collection to identify the closest match, allowing to transfer its annotation to the de novo motif when specified.To form TF cistromes, motif-based cistromes are combined based on the TF annotations.
-Generation of cisTarget databases: cisTarget and DEM require ranking and score-based databases, respectively, with regions as rows, motifs (or motif clusters) as columns, and scores or ranking values of these scores, as values.For this, we have developed an adapted version of Cluster-Buster We provide precomputed motif databases using predefined sets of regulatory regions for hg38, mm10 (using SCREEN regions 5 ) and dm6 (using cisTarget regions, defined by partitioning the entire non-coding Drosophila genome based on cross-species conservation) at: https://resources.aertslab.org/cistarget/databases/. cisTarget databases can also be generated using genomic tracks, for instance from TF ChIP-seq.To generate track databases, a bed file indicating each genomic region and the average signal (e.g. as output from bigWigAverageOverBed) has to be provided.Regions will then be sorted per track in decreasing order based on these scores.In the case of TF ChIP-seq tracks, tracks are linked to the TF that was targeted in the ChIP-seq experiment.
-cisTarget: pycisTarget implements the ranking based motif enrichment method cisTarget [21][22][23] .Briefly, genomic regions (i.e.consensus peaks, or predefined regions from SCREEN) are first scored using a motif collection with Cluster-Buster, as previously described.These regions will be ranked in decreasing order based on their score for each motif.The input regions are intersected with regions in the database (with at least 40% overlap).cisTarget uses a recovery curve approach (for each motif), in which a step is taken in the y-axis when as region in the motif ranking (x-axis) is found in the region set.The Area Under the Curve for each motif is normalized based on the average AUC for all motifs and their standard deviation, resulting in a Normalized Enrichment Score (NES) that is used to quantify the enrichment of a motif in a set of regions:

𝑚𝑒𝑎𝑛(𝐴𝑈𝐶)
The average AUC values across all motifs

𝑠𝑑(𝐴𝑈𝐶)
The standard deviation of AUC values across all motifs By default, motif that obtain a NES above 3.0 are kept.To obtain the target regions for each motif (motif-based cistrome) the regions at the top of the ranking (leading edge) are retained.The top of the ranking is defined by an automated thresholding method that retains regions with a ranking below the rank at max, which is defined by the following formula: With: !"#$% the recovery curve of the motif of interest.
To obtain target region for each TF, the motif-based cistromes of motifs annotated to the TF are merged.
-DEM: DEM performs a Wilcoxon test between scores in foreground and background region sets to assess motif enrichment.Briefly, genomic regions (i.e.consensus peaks, or predefined regions from SCREEN 5 ) are first scored using a motif collection with Cluster-Buster, as previously described.Regions in the input region sets are intersected with regions in the database (with at least 40% overlap), and a Wilcoxon rank sum test is performed between CRM score distributions in the two groups.By default, motifs adjusted p-value < 0.05 (Bonferroni) and LogFC > 0.5 are kept.A cistrome for each motif is generated by simultaneously optimizing precision and recall to separate foreground from background regions or using a predefined CRM threshold.To obtain the target regions for each TF, the motif-based cistromes of motifs (annotated to that TF are combined. Within the SCENIC+ workflow, motif enrichment is performed by default in binarized topics and DARs calculated by pycisTopic, using cisTarget and DEM (including and excluding promoters from the region sets).By default, DEM is run using topics or DARs as foreground and 500 regions in other topics/DARs as background (with the same proportion of promoters as in the foreground).Additional region sets (e.g.DARs derived from specific contrasts instead of using all populations as background) can be easily added.Cistromes derived from all the motif enrichment analysis are merged by TF to generate a final set of TF-region cistromes.

SCENIC+
eGRNs are predicted in the final step of SCENIC+.This step requires as input: the gene expression matrix, the imputed accessibility matrix (from pycisTopic) and the TF cistromes (from pycisTarget).Input data can be single-cell multiome or paired scRNAseq and scATAC-seq in matching populations (see below).This final step consists of the following steps: -Generating pseudo-multiome data (in case of non-multiome data): To generate pseudo multiome data, cells must be annotated by common labels in both data modalities (single-cell chromatin accessibility and gene expression).
Pseudo multiome cells are generated by sampling a predefined number of cells from each data modality, within the same cell type annotation label, and averaging the raw gene expression and imputed chromatin accessibility data across these cells to create a multiome meta cell containing data of both modalities.By default, the number of meta cells generated for each cell type annotation label is set as such that each cell is included in a meta cell on average twice.
-Calculating TF-to-gene and region-to-gene relationships: TF-to-gene relationships are calculated as described in 14,27 .Briefly, the Arboreto python package (v0.1.6)is used to calculate TF-to-gene importance scores for each TF and each gene, given a list of known TFs and the raw gene expression matrix.By default, Gradient Boosting Machine regression is used.Pearson correlation is used to separate positively correlating from negatively correlating relationships (resp.correlation coefficient above 0.03 or below -0.03 by default).The TF itself is not included in the initial TF-gene relationship calculation (otherwise it would skew the importance scores of the other genes).
Therefore, in order to be able to infer autoregulation (TFs regulating their own expression) the importance score of the TF itself is set to the maximum importance score across all genes added with an arbitrary small value of 1E-5, in order to put the TF at the top of its own ranking.Region-to-gene relationships are calculated for each gene by considering all regions within a search space surrounding that gene.By default, a search space of a minimum of 1kb and a maximum of 150kb upstream/downstream of the start/end of the gene or the promoter of the nearest upstream/downstream gene is used.By default, the promoter of the gene is considered as the TSS of the gene +/-10 bps.For each consensus peak in the search space of each gene region-to-gene importance scores are calculated using the Arboreto python package (v0.1.6)   using the imputed accessibility of the regions as predictors for the raw gene expression counts and Gradient Boosting Machine regression (by default).
Spearman rank correlation (SciPy v1.8.1) is used to separate positively correlating from negatively correlating relationships (resp.correlation coefficient above 0.03 or below -0.03 by default).Before eGRN building, region-to-gene relationships are binarized using multiple methods.By default, the 85 th , the 90 th and the 95 th quantile of the region-to-gene importance scores, the top 5, 10, and 15 regions per gene based on the region-to-gene importance scores and a custom implementation of the BASC 28 method on the region-togene importance scores is used.
-eRegulon creation: We define an eRegulon as a TF together with its target regions and genes.To generate this, information from gene expression, region accessibility and motif enrichment are combined.For each TF, TF-region-gene triplets are generated by taking all regions that are enriched for a motif annotated to the TF and all genes linked to these regions, based on the binarized region-to-gene links (see "Calculating TF-gene and region-gene relationships").However, we only want to include genes, and the regions linked to them, in the final eRegulon if they are significantly co-expressed with the TF.
To determine this, Gene Set Enrichment Analysis (GSEA) is used.Here, all genes are ranked based on the TF-gene importance scores and an enrichment analysis of the set of genes in the triplet compared to the overall ranking of the genes is computed, using the gsea_compute function from the GSEApy python package (v0.10.8).Finally, only the genes at the top of the ranking, known as the leading edge are retained in the final eRegulon.This analysis is run separately for TF-gene and region-to-gene relationships with positive and negative correlation coefficients, for a total of four GSEA runs.By default, eRegulons with less than 10 predicted target genes or obtained from region-togene relationships with a negative correlation coefficient are discarded.
-eRegulon enrichment: Target regions and genes of each eRegulon are used separately together with regions and genes ranked based on imputed accessibility and raw gene expression counts in each cell as input for AUCell -eRegulon dimensionality reduction: The eRegulon enrichment scores for regions and genes are normalized for each cell and used as input into the UMAP, tSNE or PCA from the python package umap (v0.5.2), fitsne (v1.2.1) or Scikit-learn (v0.24.2) respectively.
-eRegulon specificity scores: eRegulon specificity scores are calculated using the RSS algorithm as described in 14,29 using target region or target gene eRegulon enrichment scores as input.The Regulon Specificity Score (RSS) is used to identify marker regulons that differentiate between clusters or cell types.
We use the RSS for plotting regulon enrichment (as it normalizes the AUCell values) and to prioritize regulons per cell type (to select the most specific regulons for plotting, or to prioritize the differentiation velocities).
-Triplet ranking: all TF-region-gene triplets, as identified by SCENIC+, are ranked by generating the aggregated ranking of the TF-to-gene score, regionto-gene score and TF-to-region score.The TF-to-gene and region-to-gene scores are defined as the Gradient Boosting Machine regression importance scores for predicting gene expression from resp.TF expression and region accessibility across all cells.The TF-to-region score is defined as the best ranked position of the region across the ranks of all motifs annotated to the TF of interest.Ranks are aggregated as described in 30 .
-Export to loom files: To visualize SCENIC+ results in SCope 20 a chromatin accessibility-and gene expression-based loom file containing the eRegulons with target regions/genes and eRegulon enrichment scores for target regions/genes is generated, using the LoomXpy python package (v0.4.1; https://github.com/aertslab/LoomXpy).In addition, loom files can also be used to import data into R via the SCopeLoomR package.
-Visualization in the UCSC genome browser: To visualize SCENIC+ results in the UCSC genome browser a UCSC interaction file, containing region-togene links color coded by region-to-gene importance scores or correlation coefficients, and a bed file, containing genomic coordinates of eRegulon target regions is generated.The UCSC interact file and the bed file are converted to the bigInteract and bigBed format using the bedtobigbed program (v2.7) from UCSC.These can be uploaded to the UCSC genome browser along with pseudobulk BigWig files.

•:
Cao_Juan_2009Uses a divergence-based metric as in Cao Juan et al (2009) 10 using the topic-region distribution.Better models have a lower score for this metric.