Main

Gene expression can be modulated through the interplay of proximal and distal regulatory domains brought together in 3D space1. Chromatin regulatory domains, transcription factors (TFs), and downstream target genes form regulatory circuits2. Within circuits, the binding of TFs to chromatin regions and the 3D looping between these regions and gene promoters represent the mechanisms governing how TFs transform regulatory signals into changes in RNA transcription3,4. In disease, these circuits could be dysregulated in a cell-type-specific manner and may not be observed from bulk samples5. Therefore, identifying the impact of disease on regulatory circuits requires a framework for mapping regulatory domains with chromatin accessibility changes to altered gene expression in the context of cell-type resolution6. Single-cell RNA sequencing (scRNA-seq) and single-cell transposase-accessible chromatin sequencing (scATAC-seq) characterizing disease states have improved the identification of differential chromatin sites and/or differentially expressed genes within individual cell types5,7,8.

Yet, advances in single-cell assay technology have outpaced the development of methods to maximize the value of multiomics datasets for studying disease-associated regulation, especially for the regulatory interactions that are not directly measured by the omics data. Recent computational approaches9,10,11,12 to support the multiomics data analysis demonstrate the promise of this area but still lack the capacity to resolve regulation changes within individual cell types, which precludes elucidating regulatory circuits affected by the disease or showing different responses in varying disease states.

To address these, we developed MAGICAL, a method that models coordinated chromatin accessibility and gene expression variation to identify circuits (both the units and their interactions) that differ between conditions. MAGICAL analyzes scRNA-seq and scATAC-seq data using a hierarchical Bayesian framework. To accurately detect differences in regulatory circuit activity between conditions, MAGICAL introduces hidden variables for explicitly modeling the transcriptomic and epigenetic signal variations between conditions and optimization against the noise in both scRNA-seq and scATAC-seq datasets. Because regulatory circuits are cell-type specific13, MAGICAL reconstructs them at cell-type resolution. Systematic benchmarking against multiple public datasets supported the accuracy of MAGICAL-identified regulatory circuits.

S.aureus, a bacterium often resistant to common antibiotics, is a major cause of severe infection and mortality14,15. Using single-cell multiomics data generated from peripheral blood mononuclear cell (PBMC) samples of S.aureus-infected subjects and healthy controls, MAGICAL identified host response regulatory circuits that are modulated during S.aureus bloodstream infection, and circuits that discriminate the responses to methicillin-resistant S.aureus (MRSA) and methicillin-susceptible S.aureus (MSSA). Genes in the host circuits accurately predicted S.aureus infection in multiple validation datasets. Moreover, in contrast to conventional differential analysis that failed to identify specific genes for robust antibiotic-sensitivity prediction, MAGICAL-identified circuit genes can differentiate MRSA from MSSA. Therefore, MAGICAL can be used for multiomics-based gene signature development, providing a bioinformatic solution that can improve disease diagnosis.

Results

MAGICAL framework

MAGICAL identifies disease-associated regulatory circuits by comparing single-cell multiomics data (scRNA-seq and scATAC-seq) from disease and control samples (Fig. 1a). The framework incorporates TF motifs and chromatin topologically associated domain (TAD) boundaries as prior information to infer regulatory circuits comprising chromatin regulatory sites, modulatory TFs, and downstream target genes for each cell type. In brief, to build candidate disease-modulated circuits, differentially accessible sites (DAS) within each cell type are first associated with TFs by motif sequence matching and then linked to differentially expressed genes (DEG) in that cell type by genomic localization within the same TAD. Next, MAGICAL uses a Bayesian framework to iteratively model chromatin accessibility and gene expression variation across cells and samples in each cell type and to estimate the confidence of TF–peak and peak–gene linkages for each candidate circuit (Fig. 1b).

Fig. 1: Overview of MAGICAL for mapping disease-associated regulatory circuits from scRNA-seq and scATAC-seq data.
figure 1

a, Disease-modulated regulatory circutis. In the 3D genome, the altered gene expression in cells between disease and control conditions can be attributed to the chromatin accessibility changes of proximal and distal chromatin sites regulated by TFs. b, MAGICAL framework. To identify disease-associated regulatory circuits in a selected cell type (including ATAC assay cells and RNA assay cells from samples being compared), MAGICAL selects DAS as candidate chromatin sites (peaks) and DEG as candidate genes. Then, the filtered ATAC data and RNA data of DAS and DEG are used as input to a hierarchical Bayesian framework pre-embedded with the prior TF motifs and TAD boundaries. The chromatin activity A is modeled as a linear combination of TF–peak binding confidence B and the hidden TF activity T, with data noise contamination NA. The gene expression R is modeled as a linear combination of B, T, and peak–gene looping confidence L, with data noise contamination NR. MAGICAL estimates the posterior probabilities P(B|A,T), P(T|A,B), and P(L|R,B,T) by iteratively sampling variables B, T, and L to optimize against the data noise NA and NR in both modalities. Finally, regulatory circuits with high posterior probabilities of B and L (for example, a high confidence circuit with inferred interactions between TF1, Site2, and Gene1) are selected. c, Results validation. We evaluate the accuracy and cell-type specificity of the inferred peak–gene looping interactions by checking their enrichment with cell-type-matched chromatin interactions in Hi-C experiments. For the identified TFs, chromatin sites, and genes in circuits, we checked the accuracy of each using independent ChIP-seq, scATAC-seq, and scRNA-seq data. Finally, as a demonstration of the utility of MAGICAL, we used the circuit target genes as features to predict disease states.

To accurately identify varying circuits between different conditions, MAGICAL explicitly models signal and noise in chromatin accessibility and gene expression data (see Methods section ‘MAGICAL’). A TF–peak binding variable and a hidden TF activity variable are jointly estimated to fit to the chromatin accessibility variation across cells from the conditions being compared. These two variables are then used together with a peak–gene looping variable to fit the gene expression variation. Using Gibbs sampling, MAGICAL iteratively estimates variable values and optimizes the states of circuit TF–peak–gene linkages. Finally, high-confidence circuits fitting the signal variation in both data types are selected.

TF activity represents the regulatory capacity (protein level) of a particular TF protein16,17, which is distinct from TF expression. For each TF, we assume its hidden TF activities following an identical distribution across cells in the same cell type and the same sample, regardless of whether the cells are from the scATAC-seq assay, the scRNA-seq assay, or both. MAGICAL iteratively learns the activity distribution for each TF and estimates the specific activities of all TFs in each cell (Supplementary Fig. 1). This procedure eliminates the requirement of cell-level pairing of RNA-seq and ATAC-seq data. Thus, MAGICAL is a general tool that can analyze single-cell true multiome or sample-paired multiomics datasets.

We validated MAGICAL in multiple ways, demonstrating that it infers regulatory circuits accurately (Fig. 1c). MAGICAL-inferred linkages between chromatin sites and genes were validated using experimental 3D chromatin interactions. The resulting circuit genes, sites, and their regulatory TFs were evaluated in multiple independent studies. And finally, as one example of utility, we showed that the circuit genes can be used as features to classify disease states, providing a bioinformatics solution to challenging diagnostic problems.

Comparative analysis of performance

MAGICAL is a scalable framework. It can infer regulatory circuits of TFs, chromatin sites, and genes with differential activities between contrast conditions or infer regulatory circuits with active chromatin sites and genes in a single condition. Because existing integrative methods11,12,18 can only be applied to single-condition data, to provide a comparative assessment of the performance of MAGICAL, we restricted MAGICAL to the single-condition data analysis possible with existing methods.

For peak–gene looping inference, we compared MAGICAL to the TRIPOD11 and FigR18 methods, using the same benchmark single-cell multiome datasets as used by the authors reporting these methods. In the comparison of MAGICAL with TRIPOD using a 10x multiome single-cell dataset, MAGICAL-inferred peak–gene loops showed significantly higher enrichment of experimentally observed chromatin interactions in blood cells in the 4DGenome database19 (P < 0.0001, two-sided Fisher’s exact test, Supplementary Fig. 2a), the same validation data used by TRIPOD developers. MAGICAL also significantly outperformed FigR on the application to a GM12878 SHARE-seq dataset10. In that case, the peak–gene loops in MAGICAL-selected circuits had significantly higher enrichment of H3K27ac-centric chromatin interactions20 than did FigR (P < 0.0001, two-sided Fisher’s exact test, Supplementary Fig. 2b).

Because the MAGICAL framework, unlike TRIPOD and FigR, used chromatin TAD boundaries as prior information, we evaluated whether the improvement in performance resulted solely from this additional information. To investigate this, we eliminated the use of TAD boundaries and modified MAGICAL for this test by assigning candidate linkages between peaks and genes within 500 kb (a naive distance prior). As shown in Supplementary Fig. 2a,b, even without the prior TAD information, MAGICAL still outperformed the competing methods (P < 0.001, two-sided Fisher’s exact test). Overall, these results suggest that in addition to the benefit of priors, explicit modeling of signal and noise in both chromatin accessibility and gene expression data increased the accuracy of peak–gene looping identification.

MAGICAL analysis of COVID-19 single-cell multiomics data

To demonstrate the accuracy of the primary application of MAGICAL on contrast-condition data to infer disease-modulated circuits, we applied MAGICAL to sample-paired PBMC scRNA-seq and scATAC-seq data from individuals infected with SARS-CoV-2 and healthy controls5. Because immune responses in patients with COVID-19 differ according to disease severity21,22, MAGICAL inferred the regulatory circuits for mild and severe clinical groups separately. The chromatin sites and genes in the identified circuits were validated using newly generated and publicly available independent COVID-19 single-cell datasets (Fig. 2a). We primarily focused on three cell types that have been found to show widespread gene expression and chromatin accessibility changes in response to SARS-CoV-2 infection23,24, including CD8 effector memory T (TEM) cells, CD14 monocytes (Mono), and natural killer (NK) cells. In total, MAGICAL identified 1,489 high-confidence circuits (1,404 sites and 391 genes) in these cell types for mild and severe clinical groups (Supplementary Table 1; Methods section ‘MAGICAL analysis of COVID-19 single-cell multiomics data’).

Fig. 2: Validation of COVID-19-associated circuit chromatin sites and genes.
figure 2

a, We applied MAGICAL to a COVID-19 PBMC single-cell multiomics dataset and identified circuits for the clinical mild and severe groups. We validated the MAGICAL-selected circuit sites and genes using newly generated and independent COVID-19 single-cell datasets. b, UMAPs of a newly generated independent scATAC-seq dataset including 16,000 cells from six people with COVID-19 and 9,000 cells from three controls showed chromatin accessibility changes in CD8 TEM, CD14 Mono, and NK cell types. c,d, The precision of MAGICAL-selected circuit sites is significantly higher than that of the original DAS, the nearest DAS to DEG, or all DAS in the same TAD with DEG. e,f, The precision of circuit genes are significantly higher than that of DEG. c,e, For mild COVID-19, MAGICAL identified 645 sites in CD8 TEM, 599 sites in CD14 Mono, and 148 sites in NK, regulating 153 genes, 183 genes, and 60 genes, respectively. d,f, For severe COVID-19, MAGICAL identified 78 sites, 202 sites, and 62 sites in the three cell types, regulating 25 genes, 81 genes, and 26 genes, respectively. cf, Precision is defined as the proportion of the selected sites and genes to be differentially accessible and differentially expressed in the same cell type between infection and control conditions in independent datasets. Results are presented as bar plots where the heights represent the precision and the error bars represent the 95% confidence interval. Significance is evaluated using a two-sided Fisher’s exact test and P values between bars are shown.

Source data

To confirm the circuit chromatin sites selected by MAGICAL for mild COVID-19, we generated an independent PBMC scATAC-seq dataset from six people infected with SARS-CoV-2 with mild symptoms and three uninfected (polymerase chain reaction (PCR)-negative) controls (Fig. 2b; Supplementary Table 2). Approximately 25,000 quality cells were selected after quality-control (QC) analysis. These cells were integrated, clustered, and annotated using ArchR25 (Supplementary Fig. 3; Supplementary Table 3). Peaks were called from each cell type using MACS226. In total, 284,909 peaks were identified (Supplementary Table 4). For the three selected cell types, differential analysis between COVID-19 and control groups returned 3,061 sites for CD8 TEM, 1,301 sites for CD14 Mono, and 1,778 sites for NK (Supplementary Table 5; Methods section ‘COVID-19 PBMC scATACseq data analysis’). This produced three validation peak sets for mild COVID-19 infection. For severe COVID-19, an existing study focused on T cells identified specific chromatin activity changes with severe COVID-19 in CD8 T cells27. We used their reported chromatin sites to validate the circuit chromatin sites identified in CD8 T cells. In all four validation sets, the precision (proportion of sites that are differential in the validation data) of the MAGICAL-selected chromatin sites was significantly higher than the original DAS (P < 0.001, two-sided Fisher’s exact test, Fig. 2c,d).

When multiple potential chromatin regulatory loci are identified in the vicinity of a specific gene, it is commonly assumed that the locus closest to the transcriptional starting site (TSS) is likely to be the most important regulatory site. Challenging this assumption, however, are experimental studies that show genes may not be regulated by the nearest region28,29. Supporting the importance of more distal regulatory loci, MAGICAL-selected chromatin sites significantly outperformed the nearest DAS to the TSS of DEG or all DAS within the same TAD with DEG, and the improvement is substantial (precision is approximately 50% better with MAGICAL, P < 0.05, two-sided Fisher’s exact test, Fig. 2c,d).

To validate the circuit genes modulated by mild or severe COVID-19, we used genes reported by external COVID-19 single-cell studies21,30,31. In total, we collected six validation gene sets (three cell types for mild COVID-19 and three cell types for severe COVID-19). The precision of MAGICAL-selected circuit genes is significantly higher than that of original DEG in all validations (precision is approximately 30% better with MAGICAL, P < 0.05, two-sided Fisher’s exact test, Fig. 2e,f). These results confirmed the increased accuracy of disease association for both chromatin sites and genes in MAGICAL-identified regulatory circuits.

MAGICAL analysis of S. aureus single-cell multiomics data

We applied MAGICAL to the clinically important challenge of distinguishing MRSA and MSSA infections32,33,34. We profiled sample-paired scRNA-seq and scATAC-seq data using human PBMCs from adults whose blood cultures were positive for S. aureus (10 MRSA and 11 MSSA), and from 23 uninfected control subjects (Fig. 3a; Supplementary Table 6). To integrate scRNA-seq data from all samples, we implemented a Seurat-based35 batch correction and cell type annotation pipeline (Methods section ‘S.aureus scRNA-seq data analysis’). In total, 276,200 quality cells were selected and labeled (Fig. 3b; Supplementary Fig. 4; Supplementary Table 7). For scATAC-seq data, we integrated the fragment files from quality samples using ArchR25 and selected and annotated 70,174 quality cells (Fig. 3c; Supplementary Fig. 5; Supplementary Table 8). In total, 388,860 peaks were identified (Supplementary Fig. 5b; Supplementary Table 9; Methods section ‘S. aureus scATAC-seq data analysis’). In total, 13 major cell types that surpassed the 200-cell threshold in both scRNA-seq and scATAC-seq data were selected for subsequent analysis (Supplementary Fig. 6). Differential analysis for three contrasts (MRSA versus control, MSSA versus control, and MRSA versus MSSA) in each cell type returned a total of 1,477 DEG and 23,434 DAS (Supplementary Fig. 7; Supplementary Tables 10 and 11).

Fig. 3: MAGICAL accurately identified distal regulatory chromatin sites and epi-driven genes associated with S.aureus infection.
figure 3

a, We collected PBMC samples from 10 subjects infected with MRSA, 11 with MSSA, and 23 uninfected control subjects and generated sample-paired scRNA-seq and scATAC-seq data using separate assays. b, UMAP of integrated scRNA-seq data with 18 PBMC cell subtypes. c, UMAP of integrated scATAC-seq data with 13 PBMC cell subtypes. Under-represented subtypes including cDC1, CD4 TEM, CD8 CTL, pDC, and Plasmablast (representing less than 5% of cells in the scRNA-seq data in total), were not recovered from the scATAC-seq data. d, The number of MAGICAL-identified regulatory circuits in contrast analysis for each cell type. e, The number of shared and specific circuits between cell types. f, Enrichment of circuit peak–gene interactions in each cell type with cell-type-specific pcHi-C interactions. gi, We specifically analyzed MAGICAL-identified regulatory circuits for CD14 Mono. g, TF motif enrichment analysis in circuit sites showed that AP-1 proteins are mostly significantly enriched at chromatin regions with increased accessibility in the infection condition. The log2(FC) is calculated for each TF by dividing the number of binding sites with increased chromatin activity in the infection condition by the number of sites with decreased activity. h, In total, 633 circuit sites were identified by MAGICAL. Compared with all accessible chromatin sites, an increased proportion of circuit sites were in the range of 15 kb to 25 kb relative to gene TSS. In the curve, the center points represent the FC between the proportions of circuit sites and background sites at each location. The upper and lower points represent the 95% confidence interval. i, The circuit genes were significantly enriched with experimentally confirmed epigenetically driven genes (epi-genes) in monocytes. All significance was assessed using adjusted P values from a one-sided hypergeometric test.

Source data

MAGICAL identified 1,513 high-confidence regulatory circuits (1,179 sites and 371 genes) within cell types for three contrasts (MRSA versus control, MSSA versus control, and MRSA versus MSSA) (Supplementary Table 12; Methods: MAGICAL analysis of S.aureus single-cell multiomics data). It has been reported that activation of CD14 Mono plays a principal role in response to S.aureus infection36,37. In MAGICAL analysis, CD14 Mono showed the highest number of regulatory circuits (Fig. 3d). Comparing circuits between cell types we found that these disease-associated circuits are cell-type-specific (Fig. 3e). For example, circuits rarely overlapped between very distinct cell types like monocytes and T cells. Between relevant cell types like CD14 Mono and CD16 Mono, or between subtypes of T cells, most circuits are still specific for one cell subtype. These circuits were further validated using cell type-specific chromatin interactions reported in a reference promoter capture (pc) Hi-C dataset13. In all the cell types for which the cell-type-specific pcHi-C data was available (B cells, CD4 T cells, CD8 T cells, CD14 Mono), the circuit peak–gene interactions showed significant enrichment of pcHi-C interactions in matched cell types (Fig. 3f; P < 0.01, one-sided hypergeometric test). For comparison, we also performed the peak–gene interaction enrichment analysis between different cell types, finding significantly lower enrichment levels. These results indicate the cell-type specificity of MAGICAL-identified circuits.

In CD14 Mono, MAGICAL identified AP-1 complex proteins as the most important regulators, especially at chromatin sites with increased activity in cells exposed to infections (Fig. 3g). This finding is consistent with the importance of this protein complex in gene regulation in response to a variety of infections5,38,39. Supporting the accuracy of the identified TFs, we compared circuit chromatin sites with ChIP-seq peaks from the Cistrome database40. The most similar TF ChIP-seq profiles were from AP-1 complex proteins JUN and FOS in blood or bone marrow samples (Supplementary Fig. 8). Moreover, functional enrichment analysis41 of the circuit genes showed that cytokine signaling, a known pathway mediated by AP-1 complex and associated with the inflammatory responses in macrophages42,43, was the most enriched (adjusted P = 2.4 × 10–11, one-sided hypergeometric test).

MAGICAL modeled regulatory effects of both proximal and distal sites on genes. We examined the chromatin site location relative to the target gene TSS, for the identified circuits in CD14 Mono. Compared to all ATAC peaks called around the circuit genes, a substantially increased proportion of circuit chromatin sites were located 15Kb to 25Kb away from the TSS (Fig. 3h). This pattern is consistent with the 24Kb median enhancer distance found by CRISPR-based perturbation in a blood cell line44. In addition, nearly 50% of circuit chromatin sites were overlapping with enhancer-like regions in the ENCODE database45, further emphasizing that MAGICAL circuits are enriched in distal regulatory loci. We also found that these circuit chromatin sites were significantly enriched in inflammatory-associated genomic loci reported in the genome-wide association studies (GWAS) catalog database46, suggesting active host epigenetic responses to infectious diseases (Supplementary Fig. 9; P < 0.005 when compared to control diseases, two-sided Wilcoxon rank sum test). Notably, one distal chromatin site (hg38 chr6: 32,484,007–32,484,507) looping to gene HLA-DRB1 is within the most significant GWAS region (hg38 chr6: 32,431,410–32,576,834) previously reported to associate with S.aureus infection47.

We finally compared circuit genes to existing epi-genes whose transcriptions were significantly driven by epigenetic perturbations in CD14 Mono48. MAGICAL-identified circuit genes were significantly enriched with epi-genes (Fig. 3i; adjusted P < 0.005, one-sided hypergeometric test) while the remaining DEG not selected by MAGICAL, or DEG mappable with DAS either within the same topological domains or closest to each other showed no evidence of being epigenetically driven. These results suggest that MAGICAL accurately identified regulatory circuits activated in response to S.aureus infection.

S. aureus infection prediction

Early diagnosis of S.aureus infection and the strain’s antibiotic sensitivity is critical to choosing appropriate treatment for this life-threatening condition. We first evaluated whether the MAGIC-identified circuit genes that are common to MRSA and MSSA infections could provide a robust signature for predicting the diagnosis of S.aureus infection in general. Within each cell type, we selected circuit genes common to both the MRSA versus control and MSSA versus control analyses, resulting in 152 genes (Fig. 4a; Supplementary Table 12). To evaluate the prediction accuracy of these molecular features on S.aureus infection, we collected external, public expression data of S.aureus infected subjects. In total, we found one adult whole-blood49 and two pediatric PBMC bulk microarray datasets50,51 that comprised a total of 126 subjects infected with S.aureus and 68 uninfected controls. The use of pediatric validation data has the advantage of providing a much more rigorous test of the robustness of MAGICAL-identified circuit genes for classifying disease samples in this very different cohort.

Fig. 4: MAGICAL-identified circuit genes robustly predict S.aureus infection and bacteria antibody sensitivity.
figure 4

a, Circuit genes in common to MRSA and MSSA infections achieved a near-perfect classification of S.aureus infected and uninfected samples in multiple independent datasets (one adult dataset and two pediatric datasets). b, Circuit genes that differed between MRSA and MSSA showed predictive value of antibiotic sensitivity in independent patient samples (three pediatric datasets).

Source data

To allow validation using public bulk transcriptome datasets, we refined the 152 circuit genes set by selecting those with robust performance in our dataset at pseudobulk level. We calculated an area under the receiver operating characteristic curve (AUROC) for each circuit gene by classifying S.aureus infected and control subjects using pseudobulk gene expression (aggregated from the discovery scRNA-seq data). In total, 117 circuit genes with AUROCs greater than 0.7 were selected (Supplementary Table 13; Supplementary Fig. 10a). Functional gene enrichment analysis showed that interleukin (IL)-17 signaling was significantly enriched (adjusted P = 2.4 × 10–4, one-sided hypergeometric test), including genes from the AP-1, Hsp90, and S100 families. IL-17 has been found to be essential for the host defense against cutaneous S.aureus infection in mouse models52. We trained a support vector machine (SVM) model using the selected circuit genes as features and the discovery pseudobulk gene expression data as input. We then applied the trained SVM model to each of the three validation datasets. The model achieved high prediction performance on all datasets, with AUROCs from 0.93 to 0.98 (Fig. 4a).

This generalizability of circuit genes for predicting infection in different cohorts suggested that MAGICAL identifies regulatory processes that are fundamental to the host response to S.aureus sepsis. We further evaluated this by comparing the 117 circuit genes to 366 filtered DEG (with per gene AUROC > 0.7 in the discovery pseudobulk gene expression data). We examined the differential expression π value53 (a statistic score that combines both fold change (FC) and P-values) of genes in the validation datasets and found significantly higher π values for the circuit genes (Supplementary Fig. 10b; P = 9.0 × 10–3, one-sided Wilcoxon rank sum test).

S. aureus antibiotic sensitivity prediction

We then addressed the more challenging problem of predicting strain antibiotic sensitivity among S. aureus infected subjects. When we tested the predictive models trained with DEG for the contrast of MRSA and MSSA on three pediatric PBMC microarray datasets50,51,54 (comprising a total of 66 MRSA and 45 MSSA samples), we did not find predictive value (median of prediction areas under the curve (AUCs) close to 0.5; Supplementary Fig. 10d–f). And in all tests, the statistical difference of DEG-based prediction scores between MRSA and MSSA samples in the validation datasets was never significant. These results suggest that using host scRNA-seq data alone fails to identify robust molecular features for predicting the antibiotic sensitivity of the infected strain. Our observation echoes previous studies showing that in some challenging cases, differential expression analysis using RNA-seq data had limited power to identify robust features for disease-control sample classification55.

With MAGICAL we identified 53 circuit genes from the comparative multiomics data analysis between MRSA and MSSA (Supplementary Table 14). A model trained using 32 circuit genes that were robustly differential in the discovery pseudobulk data (per gene discovery AUROC > 0.7, Supplementary Fig. 10c) best distinguished antibiotic-resistant and antibiotic-sensitive samples in all three validation datasets, with AUROCs from 0.67 to 0.75 (Fig. 4b). And the statistical difference between prediction scores of MRSA and MSSA samples was significant (P = 9.2 × 10–3, two-sided Wilcoxon rank sum test). The success of the circuit-gene-based model demonstrated that MAGICAL captured generalizable regulatory differences in the host immune response to these closely related bacterial infections.

Discussion

MAGICAL addressed the previously unmet need of identifying differential regulatory circuits based on single-cell multiomics data from contrast conditions. Critically, it identifies regulatory circuits involving distal chromatin sites. The previously difficult-to-predict distal regulatory sites are increasingly recognized as key for understanding gene regulatory mechanisms. As MAGICAL uses DAS and DEG called from a pre-selected cell type, for less distinct cell types or conditions, it is harder for MAGICAL to infer circuits at cell-type resolution as there will be few candidate peaks and genes. Also, MAGICAL analyzes each cell type separately, and cell-type specificity is not directly modeled for disease circuit identification. Incorporating an approach to directly identify cell-type-specific circuits regulated in disease conditions would be valuable. In future work, we hope to extend the MAGICAL framework to improve circuit identification when cell types are poorly defined and to model cell-type specificity.

Methods

Human participants

The COVID-19 study protocol was approved by the Naval Medical Research Center institutional review board (protocol number NMRC.2020.0006) in compliance with all applicable federal regulations governing the protection of human subjects. The S. aureus sepsis study protocol was reviewed and approved by the Duke Medical School institutional review board (protocol number Pro00102421). Subjects provided written informed consent prior to participation.

Statistics and reproducibility

No statistical methods were used to pre-determine sample sizes. No data were excluded from the analyses. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.

S. aureus patient and control samples selection

Patients with culture-confirmed S.aureus bloodstream infection transferred to Duke University Medical Center were eligible if pathogen speciation and antibiotic susceptibilities were confirmed by the Duke Clinical Microbiology Laboratory. DNA and RNA samples, PBMCs, clinical data, and the bacterial isolate from the subject were cataloged using an institutional review board-approved notification of decedent research. We excluded samples with prior enrollment of the patient in this investigation (to ensure statistical independence of observations) or they were polymicrobial (that is, more than one organism in blood or urine culture). In total, 21 adult patients were selected (10 MRSA and 11 MSSA). None of them received any antibiotics in the 24 h before the bloodstream infection. Control samples were obtained from uninfected healthy adults matching the sample number and age range of the patient group. In total, 23 samples were collected from two cohorts: 14 controls (provided by Weill Cornell Medicine, New York, NY), and 9 controls (provided by the Battelle Memorial Institute, Columbus, OH). Meta information of the selected subjects are provided in Supplementary Table 6.

PBMC thawing

Frozen PBMC vials were thawed in a water bath at 37 °C for 1–2 minutes and placed on ice. Roswell Park Memorial Institute (RPMI) medium with 20% fetal bovine serum (FBS) (500 μl) was added dropwise to the thawed vial, the content was aspirated and added dropwise to 9 ml of RPMI/20% FBS. The tube was gently inverted to mix, before being centrifuged at 300g for 5 min. After removal of the supernatant, the pellet was resuspended in 1–5 ml of RPMI/10% FBS depending on the size of the pellet. Cell count and viability were assessed with Trypan Blue on a Countess II cell counter (Invitrogen).

S. aureus scRNA-seq data generation

ScRNA-seq was performed (10x Genomics, Pleasanton, CA), following the Single Cell 3′ Reagents Kits V3.1 User Guidelines. Cells were filtered, counted on a Countess instrument, and resuspended at a concentration of 1,000 cells μl–1. The number of cells loaded on the chip was determined based on the 10x Genomics protocol. The 10x chip (Chromium Single Cell 3′ Chip kit G PN-200177) was loaded to target 5,000–10,000 final cells. Reverse transcription was performed in the emulsion and complementary DNA was amplified following the Chromium protocol. Quality control and quantification of the amplified cDNA were assessed on a Bioanalyzer (High-Sensitivity DNA Bioanalyzer kit) and the library was constructed. Each library was tagged with a different index for multiplexing (Chromium i7 Multiplex Single Index Plate T Set A, PN-2000240) and quality controlled using a Bioanalyzer prior to sequencing.

S. aureus scRNA-seq data analysis

Reads of scRNA-seq experiments were aligned to human reference genome (hg38) using 10x Genomics Cell Ranger software (version 1.2). The filtered feature-by-barcode count matrices were then processed using Seurat35. Quality cells were selected as those with more than 400 features (transcripts), fewer than 5,000 features, and less than 10% of mitochondrial content (Supplementary Fig. 4; Supplementary Table 7). Cell cycle phase scores were calculated using the canonical markers for G2M and S phases embedded in the Seurat package. Finally, the effects of mitochondrial reads and cell cycle heterogeneity were regressed out using SCTransform.

To integrate cells from heterogeneous disease samples, we first built a reference by integrating and annotating cells from the uninfected control samples using a Seurat-based pipeline. For batch correction, we identified the intrinsic batch variants and used Seurat to integrate cells together with the inferred batch labels. All control samples were integrated into one harmonized query matrix. Each cell was assigned a cell-type label by referring to a reference PBMC single cell dataset. The cell-type label of each cell cluster was determined by most cell labels in each. Canonical markers were used to refine the cell-type label assignment. This integrated control object was used as reference to map the infected samples.

To avoid artificially removing the biological variance between each infected sample during batch correction, we computationally predicted and manually refined cell types for each sample. All infection samples were projected onto the UMAP (Uniform Manifold Approximation and Projection) of the control object for visualization purpose. In total, 276,200 high-quality cells and 19 cell types with at least 200 cells in each were selected for the subsequent analysis. Within each cell type, DEG between contrast conditions were first called using the Findmarkers function of the Seurat V4 package35 with default parameters. DEG with Wilcoxon test false discovery rate (FDR) < 0.05, |log2(FC)| >0.1 and actively expressed in at least 10% of cells (pct > 0.1) from either condition were selected. To correct potential bias caused by the different sequencing depth between samples, we ran DEseq256 on the aggregated pseudobulk gene expression data. Refined DEG passing pseudobulk differential statistics P < 0.05 and |log2(FC)| >0.3 were selected as the final DEG (Supplementary Table 10).

Nuclei isolation for scATACseq

Thawed PBMCs were washed using phosphate-buffered saline (PBS) with 0.04% bovine serum albumin (BSA). Cells were counted and 100,000–1,000,000 cells were added to a 2 ml microcentrifuge tube. Cells were centrifuged at 300g for 5 min at 4 °C. The supernatant carefully completely removed, and 0.1X lysis buffer (1x: 10 mM Tris-HCl pH 7.5, 10 mM NaCl, 3 mM MgCl2, nuclease-free H2O, 0.1% v/v NP-40, 0.1% v/v Tween-20, 0.01% v/v digitonin) was added. After 3 min incubation on ice, 1 ml of chilled wash buffer was added. The nuclei were pelleted at 500g for 5 min at 4 °C and resuspended in a chilled diluted nuclei buffer (10x Genomics) for scATAC-seq. Nuclei were counted and the concentration was adjusted to run the assay.

S. aureus scATAC-seq data generation

ScATAC-seq was performed immediately after nuclei isolation and following the Chromium Single Cell ATAC Reagent Kits V1.1 User Guide (10x Genomics, Pleasanton, CA). Transposition was performed in 10 μl at 37 °C for 60 min on at least 1,000 nuclei, before loading of the Chromium Chip H (PN-2000180). Barcoding was performed in the Gel Bead-in-Emulsion (GEMs) (12 cycles) following the Chromium protocol. After post-GEM cleanup, libraries were prepared following the protocol and were indexed for multiplexing (Chromium i7 Sample Index N, Set A kit PN-3000427). Each library was assessed on a Bioanalyzer (High-Sensitivity DNA Bioanalyzer kit).

S. aureus scATAC-seq data analysis

Reads of scATAC-seq experiments were aligned to human reference genome (hg38) using 10x Genomics Cell Ranger software (version 1.2). The resulting fragment files were processed using ArchR25. Quality cells were selected as those with TSS enrichment >12, the number of fragments >3,000 and <30,000, and nucleosome ratio <2 (Supplementary Fig. 5a; Supplementary Table 8). The likelihood of doublet cells was computationally assessed using the addDoubletScores function and cells were filtered using the filterDoublets function with default settings. Cells passing quality and doublet filters from each sample were combined into a linear dimensionality reduction using the addIterativeLSI function with the input of the tile matrix (read counts in binned 500 bp across the whole genome) with iterations = 2 and varFeatures = 20,000. This dimensionality reduction was then corrected for batch effect using the Harmony method57, via the addHarmony function. The cells were then clustered based on the batch-corrected dimensions using the addClusters function. We annotated scATAC-seq cells using the addGeneIntegrationMatrix function, referring to a labeled multimodal PBMC single cell dataset. Doublet clusters containing a mixture of many cell types were manually identified and removed. In total, 70,174 high-quality cells and 13 cell types with at least 200 cells in each were selected.

Peaks were called for each cell type using the addReproduciblePeakSet function with the MACS2 peak caller26 (Supplementary Fig. 5b). In total, 388,859 peaks were identified (Supplementary Table 9). Within each cell type, differentially accessible chromatin sites (DAS) between contrast conditions (MRSA versus control, MSSA versus control or MRSA versus MSSA) were called from the single cell chromatin accessibility count data using the getMarkerFeatures function25, with parameter settings as testMethod = wilcoxon, bias = log10(nFrags), normBy = ReadsInPeaks, and maxCells = 15,000. Peaks with single cell differential statistics FDR < 0.05, |log2(FC)| >0.1, and actively accessible in at least 10% of cells (pct > 0.1) from either condition were selected as DAS. Owing to the high false positive rate in single-cell-based differential analysis58, we further refined the DAS by fitting a linear model to the aggregated and normalized pseudobulk chromatin accessibility data and tested DAS individually about their covariance with sample conditions56. Refined DAS passing pseudobulk differential statistics P < 0.05 and |log2(FC)| >0.3 between the contrast conditions were selected as the final DAS (Supplementary Table 11).

MAGICAL

To build candidate regulatory circuits, TFs were mapped to the candidate chromatin sites by searching for human TF motifs from the chromVARmotifs library59 using the addMotifAnnotations function (ArchR). The TF binding sites were then linked with the candidate genes by requiring them in the same TAD within boundaries. Then, a candidate circuit is constructed with a chromatin site and a gene in the same domain, with at least one TF binding at the site.

For each cell type (that is, the i-th cell type), MAGICAL inferred the confidence of TF–peak binding and peak–gene looping in each candidate circuit using a hierarchical Bayesian framework with two models: a model of TF–peak binding confidence (B) and hidden TF activity (T) to fit chromatin accessibility (A) for M TFs and P chromatin sites in KA,S,i cells with scATAC-seq measures from S samples; a second model of peak–gene interaction (L) and the refined (noise removed) regulatory region activity (BT) to fit gene expression (R) of G genes in KR,S,i cells with scRNA-seq measures from the same S samples.

$${{{A}}}_{P\times {K}_{A,S},i}={{{B}}}_{P\times M,i}{{{T}}}_{M\times {K}_{A,S},{i}}+{{{N}}}_{P\times {K}_{A,S},{i}},$$
(1)
$${{{R}}}_{G\times {K}_{R,S},{i}}={{{L}}}_{G\times P,i}{{{B}}}_{P\times M,i}{{{T}}}_{M\times {K}_{R,S},{i}}+{{{N}}}_{G\times {K}_{R,S},{i}},$$
(2)

\({{{A}}}_{P\times {K}_{A,S},i}\) was a P by KA,S,i matrix with each element \({a}_{p,{k}_{A,s},i}\) representing the ATAC read count of p-th chromatin site (ATAC peak) in the kA,s-th cell in the s-th sample.

\({{{R}}}_{G\times {K}_{R,S},i}\) was a G by KR,S,i matrix with each element \({r}_{g,{k}_{R,s},i}\) representing the RNA read count of g-th gene in the kR,s-th cell of the s-th sample.

\({{{N}}}_{P\times {K}_{A,S},{i}}\) and \({{{N}}}_{G\times {K}_{R,S},{i}}\) represented data noise corresponding to \({{{A}}}_{P\times {K}_{A,S},{i}}\) and \({{{R}}}_{G\times {K}_{R,S},{i}}\).

\({{{B}}}_{P\times M,i}\) was a P by M matrix with each element \({b}_{p,m,i}\) representing the binding confidence of the m-th TF on the p-th candidate chromatin site.

\({{{L}}}_{G\times P,i}\) was a G by P matrix with each element \({l}_{p,g,i}\) representing the interaction between the p-th chromatin site and the g-th gene.

\({{{T}}}_{M\times {K}_{A,S},{i}}\) was an M by KA,S ,i matrix with each element \(\,{t}_{m,{k}_{A,s},i}\) representing the hidden TF activity of the m-th TF in the \({k}_{A,s}\)-th ATAC cell of s-th sample.

\({{{T}}}_{M\times {K}_{R,S},{i}}\) was an M by KR,S ,i matrix with each element \(\,{t}_{m,{k}_{R,s},i}\) representing the hidden TF activity of the m-th TF in the \({k}_{R,s}\)-th RNA cell of s-th sample.

\({{{T}}}_{M\times {K}_{A,S},{i}}\) and \({{{T}}}_{M\times {K}_{R,S},{i}}\) were both extended from the same \({{{T}}}_{M\times S,{i}}\) (with elements \({t}_{m,s,i}\)) by assuming that in the i-th cell type and the s-th sample, the m-th TF’s regulatory activities in all ATAC cells and all RNA cells followed an identical distribution of a single variable \({t}_{m,s,i}\). Therefore, \({K}_{A,S,i}\) and \({K}_{R,S,i}\) can be different numbers and MAGICAL will only estimate the matrix \({{{T}}}_{M\times S,{i}}\).

To select high-confidence regulatory circuits, MAGICAL estimated the confidence (probability) of TF–peak binding \({{{B}}}_{P\times M,i}\) and peak–gene interaction \({{{L}}}_{G\times P,i}\) together with the hidden variable \({{{T}}}_{M\times S,{i}}\) in a Bayesian framework.

$$P\left({{B}},{{T}},{{L}}{\rm{|}}{{A}},{{R}}\right)\propto P({{R}}{\rm{|}}{{L}},{{B}},{{T}}\,)P({{A}}{\rm{|}}{{B}},{{T}}\,)P({{L}})P({{B}})P\left(T\,\right).$$
(3)

Based on the regulatory relationship among chromatin sites, upstream TFs, and downstream genes (as illustrated in Fig. 1), the posterior probability of each variable can be approximated as:

$$P\left({{T}}{{|}}{{A}}{{,}}{{B}}\right)\propto P({{A}}{{|}}{{B}}{{,}}{{T}}\,)P\left({{T}}\,\right),$$
(4)
$$P\left({{B}}{{|}}{{A}}{{,}}{{T}}\,\right)\propto P({{A}}{{|}}{{B}}{{,}}{{T}}\,)P\left({{B}}\right),$$
(5)
$$P\left({{L}}{\rm{|}}{{R}},{{B}},{{T}}\,\right)\propto P\left({{R}}{\rm{|}}{{L}},{{B}},{{T}}\,\right)P({{L}}).$$
(6)

Although the prior states of \({b}_{p,m,i}\) and \({l}_{p,g,i}\) were obtained from the prior information of TF motif–peak mapping and topological-domain-based peak–gene pairing, their values were unknown. We assumed zero-mean Gaussian priors for B, L and the hidden variable T by assuming that positive regulation and negative regulation would have the same priors, which is likely to be true given the fact that there were usually similar numbers of upregulated and downregulated peaks and genes after the differential analysis. We set a high variance (non-informative) in each prior distribution to allow the algorithm to learn the distributions from the input data.

$${b}_{p,m,i} \sim\mathrm{normal}(\,{\mu }_{B},{\sigma }_{B}^{2}),$$
(7)
$${t}_{m,s,i} \sim\mathrm{normal}(\,{\mu }_{T},{\sigma }_{T}^{2}),$$
(8)
$${l}_{p,g,i} \sim\mathrm{normal}(\,{\mu }_{L},{\sigma }_{L}^{2}).$$
(9)

where \((\,{\mu }_{B},{\sigma }_{B}^{2})\), \((\,{\mu }_{T},{\sigma }_{T}^{2})\), and \((\,{\mu }_{L},{\sigma }_{L}^{2})\) are hyperparameters representing the prior mean and variance of TF–peak binding, TF activity, and peak–gene looping variables.

The likelihood functions \(P({{A|B}}{{,}}{{T}}\,)\) and \(P\left({{R|L}}{{,}}{{B}}{{,}}{{T}}\,\right)\) represent the fitting performance of the estimated variables to the input data. These two conditional probabilities are equal to the probabilities of the fitting residues \({{{N}}}_{P\times {K}_{A,S},{i}}\) and \({{{N}}}_{G\times {K}_{R,S},{i}}\), for which we assumed zero-mean Gaussian distributions.

$${{A}}|{{B}}{{,}}{{T}} \sim\mathrm{normal}({\mu }_{{N}_{A}},{\sigma }_{{N}_{A}}^{2}),\,{\sigma }_{{N}_{A}}^{2} \sim\mathrm{{inverse}\,{gamma}}({\alpha }_{{N}_{A}},\,{\beta }_{{N}_{A}}),$$
(10)
$${{R}}|{{L}}{{,}}{{B}}{{,}}{{T}} \sim\mathrm{normal}({\mu }_{{N}_{R}},{\sigma }_{{N}_{R}}^{2}),{\sigma }_{{N}_{R}}^{2} \sim\mathrm{{inverse}\,{gamma}}({\alpha }_{{N}_{R}},\,{\beta }_{{N}_{R}}),$$
(11)

where \(({\mu }_{{N}_{A}},{\sigma }_{{N}_{A}}^{2})\) and \(({\mu }_{{N}_{R}},{\sigma }_{{N}_{R}}^{2})\) are hyperparameters representing the prior mean and variance of data noise in the ATAC and RNA measures. Here, the variance of the signal noise is modeled using inverse gamma distributions, with hyperparameters \(({\alpha }_{{N}_{A}},\,{\beta }_{{N}_{A}})\) and \(({\alpha }_{{N}_{R}},\,{\beta }_{{N}_{R}})\) to control the variance of fitting residues (very low probabilities on large variances).

Then, the posterior probability of each variable defined in equations (4)–(6) was still a Gaussian distribution with poster mean \(\hat{\mu }\) and variance \(\hat{\sigma }\) as shown below:

$${\hat{b}}_{p,m,i} \sim\mathrm{normal}(\,{\hat{\mu }}_{B,m,i},{{{\hat{\sigma }}_{B,m,i}}}^{2}),$$
(12)
$${\hat{t}}_{m,s,i} \sim\mathrm{normal}(\,{\hat{\mu }}_{T,m,s,i},{{\hat{\sigma }}_{T,m,s,i}}^{2}),$$
(13)
$${\hat{l}}_{p,g,i} \sim\mathrm{normal}(\,{\hat{\mu }}_{L,i},{{\hat{\sigma }}_{L,i}}^{2}).$$
(14)

Gibbs sampling was used to iteratively learn the posterior distribution mean and variance of each set of variables and draw samples of their values accordingly.

For the TF–peak binding events, the posterior mean \({\hat{\mu }}_{B,{m},i}\) and variance \({{\hat{\sigma }}_{B,m,i}}^{2}\) were estimated specifically for m-th TF since the number of binding sites and the positive or negative regulatory effects between TFs could be very different.

$$\begin{array}{l}{\hat{\mu }}_{B,m,i}={\frac{{\sum }_{s}{\sum }_{k}{t}_{m,s,i}\left({a}_{p,{ks},i}-{\sum} _{{m\prime}}{b}_{p,{m\prime},i}{t}_{{m\prime},s,i}\right){\sigma }_{B}^{2}+{\mu }_{B,t}{\sigma }_{{N}_{A}}^{2}}{{\sum }_{s}{K}_{A,s}{t}_{m,s,i}^{2}{\sigma }_{B}^{2}+{\sigma }_{{N}_{A}}^{2}}}\,{\mathrm{and}}\\{{\hat{\sigma}}_{B,m,i}}^{2}={\frac{{\sigma }_{{N}_{A}}^{2}{\sigma }_{B}^{2}}{{\sum }_{s}{K}_{A,s}{t}_{m,s,i}^{2}{\sigma }_{B}^{2}+{\sigma }_{{N}_{A}}^{2}}}.\end{array}$$
(15)

For TF activities, the posterior mean \({\hat{\mu }}_{T,m,s,i}\) and variance \({{\hat{\sigma }}_{T,m,s,i}}^{2}\) were estimated specifically for the m-th TF and s-th sample using chromatin accessibility data as follows:

$$\begin{array}{l}{\hat{\mu }}_{T,m,s,i}={\frac{{\sum }_{p}{\sum }_{k}{b}_{p,m}\left({a}_{p,k,s,i}-{\sum} _{m{\prime} }{b}_{p,m{\prime} }{t}_{m{\prime} ,s}\right){\sigma }_{T}^{2}+{\mu }_{T}{\sigma }_{{N}_{A}}^{2}}{{\sum }_{p}{K}_{A,s}{{b}_{p,m,i}^{2}{{\sigma }_{T}^{2}}+{\sigma }_{{N}_{A}}^{2}}}}\,{\mathrm{and}}\\{{\hat{\sigma }}_{T,m,s,i}}^{2}=\frac{{\sigma }_{{N}_{A}}^{2}{\sigma }_{T}^{2}}{{{\sum }_{p}{K}_{A,s}{{b}_{p,m,i}^{2}}{\sigma }_{T}^{2}}+{{\sigma}_{{N}_{A}}^{2}}}.\end{array}$$
(16)

Then, based on the estimated distribution parameters of \({\hat{\mu }}_{T,m,s,i}\) and \({{\hat{\sigma }}_{T,m,s,i}}^2\) of \({\hat{t}}_{m,s,i}\), for the \({k}_{R,s}\)-th RNA cell in the same s-th sample we draw a TF regulatory activity sample as \({\hat{t}}_{m,{k}_{R},s,i}\). For p-th peak, we were able to reconstruct its chromatin activity in the RNA cell as \({\hat{a}}_{p,{k}_{R},s,i}={\sum }_{m}{\hat{b}}_{p,m,i}{\hat{t}}_{m,{k}_{R},s,i}\), and for g-th gene, we further estimated the interaction confidence \({\hat{l}}_{p,g,i}\) between p-th peak and g-th gene. The peak–gene interaction distribution parameters \({\hat{\mu }}_{L,i}\) and \({{\hat{\sigma }}_{L,i}}^{2}\) were estimated as follows:

$$\begin{array}{l}{\hat{\mu }}_{L,i}=\frac{{\sum }_{s}{\sum }_{k}{\hat{a}}_{p,{k}_{R},s,i}\left({r}_{g,k,s,i}-{\sum }_{{p\prime}}{l}_{g,{p\prime}}{\hat{a}}_{{p\prime},{k}_{R},s,i}\right){\sigma }_{L}^{2}+{\mu }_{L}{\sigma }_{{N}_{R}}^{2}}{{\sum }_{s}{\sum }_{{k}_{R,s}}{({\hat{a}}_{p,{k}_{R},s,i})}^{2}{\sigma }_{L}^{2}+{\sigma }_{{N}_{R}}^{2}}\,{\rm{and}}\\{{\hat{\sigma }}_{L}}^{2}=\frac{{\sigma }_{{N}_{R}}^{2}{\sigma }_{L}^{2}}{{\sum }_{s}{\sum }_{{k}_{R,s}}{({{\hat{a}}_{p,{k}_{R},s,i})}}^{2}{\sigma }_{L}^{2}+{\sigma }_{{N}_{R}}^{2}}.\end{array}$$
(17)

In n-th round of Gibbs estimation, after learning all distributions, we estimated the confidence of each linkage by linearly mapping the sampled values of \({\hat{b}}_{p,m,i}\) and \({\hat{l}}_{p,g,i}\) in the range of (–∞,∞) to probabilities in (0,1) as follows:

$$\begin{array}{l}P\left(\mathrm{state}\left({b}_{p,m,i}{\rm{|}}n\right)=1\right)\\=\frac{{{\exp }}\left\{\left({\hat{b}}_{p,m,i}-{\hat{\mu }}_{B,m,i}\right)/2{{\hat{\sigma }}_{B,m,i}}^{2}\right\}}{{{\exp }}\left\{\left({\hat{b}}_{p,m,i}-{\hat{\mu }}_{B,m,i}\right)/2{{\hat{\sigma }}_{B,m,i}}^{2}\right\}+{{\exp }}\left\{\left(0-{\hat{\mu }}_{B,m,i}\right)/2{{\hat{\sigma }}_{B,m,i}}^{2}\right\}}.\end{array}$$
(18)
$$\begin{array}{l}P\left(\mathrm{state}\left({l}_{p,g,i}{\rm{|}}n\right)=1\right)\\=\frac{{{\exp }}\left\{\left({\hat{l}}_{p,g,i}-{\hat{\mu }}_{L,i}\right)/2{{\hat{\sigma }}_{L,i}}^{2}\right\}}{{{\exp }}\left\{\left({\hat{l}}_{p,g,i}-{\hat{\mu }}_{L,i}\right)/2{{\hat{\sigma }}_{L,i}}^{2}\right\}+{{\exp }}\left\{\left(0-{\hat{\mu }}_{L,i}\right)/2{{\hat{\sigma }}_{L,i}}^{2}\right\}}.\end{array}$$
(19)

Binary state samples were then drawn based on the confidence of each linkage and were then used to initiate the next round of estimations. After running a long sampling process (in total, N rounds) and accumulating enough samples on the binary states of TF–peak bindings and peak–gene interactions, we calculated the sampling frequency of each linkage as a posterior probability.

$$\left\{\begin{array}{c}P\left(\mathrm{state}\left({b}_{p,m,i}\right)=1\right)=\frac{{\sum }_{n}\mathrm{state}\left({b}_{p,m,i}{\rm{|}}n\right)}{N}\\ P\left(\mathrm{state}\left({l}_{p,g,i}\right)=1\right)=\frac{{\sum }_{n}\mathrm{state}\left({l}_{p,g,i}{\rm{|}}n\right)}{N}\end{array}\right.$$
(20)

MAGICAL analysis of S. aureus single-cell multiomics data

For each cell type, given DAS and DEG of contrast conditions (MRSA versus control, MSSA versus control or MRSA versus MSSA), MAGICAL was first initialized by mapping prior TF motifs from the chromVARmotifs library to DAS using addMotifAnnotations (ArchR). Because there is no PBMC cell type Hi-C data publicly available, we are using TAD boundaries from a lymphoblastoid cell line, GM12878, which was originally generated by EBV transformation of PBMCs60. The TAD boundary structure is closely conserved between the lymphoblastoid cell lines and primary PBMC61 and between cell types62,63. We called TAD boundaries from a GM12878 cell line Hi-C profile63 using TopDom64. About 6,000 topological domains were identified. For each contrast, we built candidate circuits by pairing DAS with TF binding sites with DEG in the same domain. MAGICAL was run 10,000 times to ensure that the sampling process converged to stable states. This process was repeated for all cell types and the top 10% high confidence circuit predictions were selected from each cell type for validation analysis.

MAGICAL analysis of COVID-19 single-cell multiomics data

As a proof of concept for contrast condition, single-cell multiomics data analysis, MAGICAL was applied to a public PBMC COVID-19 single-cell multiomics dataset5 with samples collected from patients with different severity and heathy controls. For each of the three selected cell subtypes (CD8 TEM, CD14 Mono, and NK), from the original publication we downloaded DEG for two contrasts: mild versus control and severe versus control. For each of the selected cell types, DAS were called respectively for mild versus control and severe versus control using ArchR functions and thresholds as introduced in the paper. MAGICAL was initialized by mapping TF motifs from the chromVARmotifs library to DAS using addMotifAnnotations (ArchR). As explained above, we used TAD boundary information of ~6,000 domains identified in the GM12878 cell line63 as prior to pair DAS with TF binding sites and DEG. Then, the initial candidate regulatory circuits were constructed. Respectively for mild and severe COVID-19, MAGICAL was run 10,000 times to ensure that the sampling process converged to stable states. This process was repeated for the three selected cell types. The chromatin sites and genes in the top 10% predicted high confidence circuits in each cell type were selected as disease associated.

COVID-19 PBMC samples of validation scATAC-seq data

To validate chromatin sites associated with mild COVID-19, PBMC samples were obtained from the COVID-19 Health Action Response for Marines (CHARM) cohort study, which has been previously described65. The cohort is composed of Marine recruits who arrived at Marine Corps Recruit Depot—Parris Island for basic training between May and November 2020, after undergoing two quarantine periods (first a home quarantine, and next a supervised quarantine starting at enrollment in the CHARM study) to reduce the possibility of SARS-CoV-2 infection at arrival. Participants were regularly screened for SARS-CoV-2 infection during basic training by PCR, serum samples were obtained using serum separator tubes at all visits, and a follow-up symptom questionnaire was administered. At selected visits, blood was collected in BD Vacutainer CPT Tube with Sodium Heparin and PBMC were isolated following the manufacturer’s recommendations. We used PBMC samples from six participants (five males and one female) who had a positive COVID-19 PCR test and had mild symptoms (sampled 3–11 days after the first PCR positive test), and from three control participants (three males) who had a PCR negative test at the time of sample collection and were seronegative for SARS-CoV-2 immunoglobulin G. New scATAC-seq data were generated following the same protocol as described in "S. aureus scATAC-seq data generation" (Supplementary Table 2).

COVID-19 PBMC scATACseq data analysis

Reads of scATAC-seq experiments were aligned to human reference genome (hg38) using 10x Genomics Cell Ranger software (version 1.2). The resulting fragment files were processed using ArchR25. Quality cells were selected as those with TSS enrichment >12, a number of fragments >3,000 and <30,000, and a nucleosome ratio <2. The likelihood of doublet cells was computationally assessed using the addDoubletScores function and cells were filtered using the filterDoublets function with default settings. A total of 15,836 high-quality cells in the infection group and 9,125 cells in the control group were selected after QC analysis (Supplementary Fig. 3; Supplementary Table 3). These cells were combined into a linear dimensionality reduction using the addIterativeLSI function with the input of the tile matrix (read counts in binned 500 bp across the whole genome) with iterations = 2 and varFeatures = 20,000. The cells were then clustered using the addClusters function. We annotated scATAC-seq cells using the addGeneIntegrationMatrix function, referring to a labeled multimodal PBMC single cell dataset. Doublet clusters containing a mixture of many cell types were manually identified and removed.

Peaks were called for each cell type using the addReproduciblePeakSet function with peak caller MACS226 (Supplementary Fig. 3). In total, 284,525 peaks were identified (Supplementary Table 4). For each of the three selected cell types (CD8 TEM, CD14 Mono, and NK), chromatin sites with single cell differential statistics FDR < 0.05 and |log2(FC)| >0.1 between COVID-19 and control conditions and actively accessible in at least 10% of cells (pct > 0.1) from either condition were selected. Refined peaks passing pseudobulk differential statistics P < 0.05 and |log2(FC)| >0.3 between the contrast conditions were finally selected as the validation peak set (Supplementary Table 5).

COVID-19 circuit peaks and genes accuracy evaluation

The number of infection-associated peaks/genes reported by each COVID-19 study would be different owing to the difference in the number of recruited patients and collected cells. To overcome the issue caused by the imbalanced number between discovery and validation datasets or between differential peaks/genes and circuit sites/genes, in each comparison, the larger peak/gene set was randomly downsampled to match the smaller number of peaks/genes in the other set. The precision (site reproduction rate) is calculated to assess the accuracy of each peak/gene set.

MAGICAL analysis of 10x PBMC single-cell true multiome data

For benchmarking, MAGICAL was applied to a 10x PBMC single-cell multiome dataset including 108,377 ATAC peaks, 36,601 genes, and 11,909 cells from 14 cell types. MAGICAL used the same candidate peaks and genes as selected by TRIPOD11 for fair performance comparison. Two different priors were used to pair candidate peaks and genes: (1) the peaks and genes were within the same TAD from the GM12878 cell line; (2) the centers of peaks and the TSS of genes were within 500 kbp. MAGICAL inferred regulatory circuits with each prior and used the top 10% of predictions for accuracy assessment. High-confidence peak–gene interactions predicted by TRIPOD on the same data were directly downloaded from the supplementary tables of their publication11. Two baseline approaches of peak–gene pairing were included: pairing all peaks with a gene if they are in the same TAD or pairing only the nearest peak to a gene based on their genomic distance. To fairly assess the accuracy of MAGICAL weighted peak–gene interactions and the results (paired or non-paired) from TRIPOD or baseline approaches, we selected the top 10% of predictions by MAGICAL as the final peak–gene pairing. We overlapped these pairs with the curated 3D genome interactions in blood context from the 4DGenome database19 and calculated the precision for each approach.

MAGICAL analysis of GM12878 cell line SHARE-seq data

For benchmarking, MAGICAL was also applied to a GM12878 cell line SHARE-seq dataset10. For fair comparison, MAGICAL used the same candidate peaks and genes as selected by FigR18. MAGICAL was initialized with two different priors to pair candidate peaks and genes: (1) the peaks and genes were within the same prior TAD from the GM12878 cell line; (2) the centers of peaks and the TSS of genes were within 500k bps. MAGICAL inferred regulatory circuits under each setting and used the top 10% predictions for accuracy assessment. High-confidence peak–gene interactions predicted by FigR were directly downloaded from the supplementary tables of the original publication10. Similarly, the top 10% predictions by MAGICAL and interactions paired by the two baseline approaches mentioned above were selected. We overlapped peak–gene interactions predicted by each approach with GM12878 H3K27ac HiChIP chromatin interactions20 for precision evaluation.

Validating predicted peak–gene interactions

To assess the precision of the predicted circuit peak–gene interactions, we assumed a correctly inferred peak–gene pair should be also connected by a chromatin interaction reported by Hi-C or similar experiments. To check this, each peak was extended to 2 kb long and then checked for overlap with one end of a physical chromatin interaction. For genes, we checked whether the gene promoter (–2 kb to 500 b of TSS) overlapped the other end of the interaction. Precision was calculated as the proportion of overlapped chromatin interactions among the predicted peak–gene pairs. The significance of enrichment of overlapped chromatin interactions was assessed using hypergeometric P value, with all candidate peak–gene pairs as background.

GWAS enrichment analysis

To assess the enrichment of GWAS loci of inflammatory diseases in circuit chromatin sites in each cell type, significant GWAS loci were downloaded from GWAS catalog46 for inflammatory diseases and control diseases. GREGOR66 was used to assess the enrichment of GWAS loci at which either the index single nucleotide polymorphism (SNP) or at least one of its LD proxies overlaps with a circuit chromatin site, using pre-calculated LD data from 1,000 G EUR samples. The enrichment P value of each disease GWAS was converted to a z-score. With each cell type, enrichment scores for traits with fewer than five overlapped GWAS SNPs with circuit sites were hold out. Also, as all reference data used by GREGOR is hg19-based, genome coordinates of testing regions were mapped from hg38 to hg19.

Predicting S. aureus infection state

To refine circuit genes for predicting infection diagnosis in microarray gene expression data, the capability of each circuit gene on distinguishing infection and control samples, or MRSA and MSSA samples, was assessed using sample level pseudobulk gene expression data, aggregated from the discovery scRNA-seq datasets. The total number of reads of each sample was normalized to 1 × 107. The normalized RNA read counts across all samples were then log and z-score transformed. For each circuit gene, a discovery AUROC was calculated by comparing the scRNA-seq gene-expression-based sample ranking against the contrasted sample groups. Circuit genes were prioritized based on AUROCs. An SVM model was trained using the top-ranked circuit genes as features and their normalized pseudobulk expression data as input. The model was then tested on independent microarray datasets. The microarray gene expression data was also log and z-score transformed to ensure a similar distribution to the training data. For comparison, top DEG prioritized by discovery AUROC or by other approaches like the minimum redundancy maximum relevance algorithm or LASSO regression were also tested on the same microarray datasets.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.