Deconvolution of single-cell multi-omics layers reveals regulatory heterogeneity

Integrative analysis of multi-omics layers at single cell level is critical for accurate dissection of cell-to-cell variation within certain cell populations. Here we report scCAT-seq, a technique for simultaneously assaying chromatin accessibility and the transcriptome within the same single cell. We show that the combined single cell signatures enable accurate construction of regulatory relationships between cis-regulatory elements and the target genes at single-cell resolution, providing a new dimension of features that helps direct discovery of regulatory patterns specific to distinct cell identities. Moreover, we generate the first single cell integrated map of chromatin accessibility and transcriptome in early embryos and demonstrate the robustness of scCAT-seq in the precise dissection of master transcription factors in cells of distinct states. The ability to obtain these two layers of omics data will help provide more accurate definitions of “single cell state” and enable the deconvolution of regulatory heterogeneity from complex cell populations.

repression of cell type specific genes 15 . Single-cell ATAC-seq and RNA-seq represent a 23 great opportunity to study how TFs and epigenomic features induce transcriptional 24 outcomes that influence cell fate determinations. For example, combined analyses of 25 datasets by these two approaches have enabled characterization of subtypes in mouse 26 tissues 16 or during human hematopoietic differentiation 17 . However, it still remains 27 challenging to integrate the two approaches experimentally in individual cells, thus 28 hampering a full understanding of regulatory association between these two layers. Here, 29 we present scCAT-seq (single-cell chromain accessibility and transcriptome sequencing), 30 a technique that integrates single-cell ATAC-seq and RNA-seq to measure chromatin 31 accessibility (CA) and gene expression (GE) simultaneously in single cells. scCAT-seq 32 employs a mild lysis approach and a physical dissociation strategy to separate the nucleus 33 and cytoplasm of each single cell. Thereafter, the supernatant cytoplasm component is 34 subjected to the Smart-seq2 method as described previously 7 . The precipitated nucleus is 35 3 then subjected to a Tn5 transposase-based and carrier DNA-mediated protocol to amplify 1 the fragments within accessible regions ( Fig. 1a and Supplementary Methods). Beyond 2 parallel CA and GE profiling in the same single cell, scCAT-seq will be particularly useful for 3 analyzing samples when the amount of input material is limited. 4 5

Results 6
Simultaneous profiling of accessible chromatin and gene expression in single cells. 7 We applied scCAT-seq to the K562 chronic myelogenous leukemia cell line, which has 8 been widely used in the ENCODE project. We sorted single cell and multi-cell samples (e.g., 9 500 cells) into wells of 96-well plates using flow cytometry. Empty wells were used as 10 negative control. Samples were then processed using the scCAT-seq protocol. qPCR 11 analysis confirmed the successful capture of single cell nuclei during library preparation 12 ( Supplementary Fig. 1a). We generated combined CA and GE profiles from a total of 192 13 For scCAT-seq-generated CA data, we obtained an average of 2.1 x 10 5 uniquely mapped, 17 usable fragments from single cells (Supplementary Table 1 and Supplementary Fig.  18 1c,d). Similar to bulk ATAC-seq 18 , the CA fragments show fragment-size periodicity 19 corresponding to integer multiples of nucleosomes (Supplementary Fig. 1e) and are 20 strongly enriched on accessible regions ( Fig. 1b and . 1f) which is largely reduced in comparison to standard bulk ATAC-seq studies (typically 23 over 30%) 18 . Pearson correlation analyses revealed our single-cell profiles could reproduce 24 features of bulk profiles (Supplementary Fig. 1g). In comparison to the published scATAC-25 seq profiles by Buenrostro et al. 10 , we obtained a higher number of usable fragments per 26 single cell but with lower signal-to-noise ratio (Supplementary Fig. 1h). However, the 27 correlation between single cells increases remarkably (Supplementary Fig. 1h), suggesting 28 that scCAT-seq is able to capture the chromatin features more accurately. 29 30 For mRNA data generated by scCAT-seq, we obtained an average of 4.6 million reads 31 covering over 8000 genes (GENCODE v19, TPM > 1), which is comparable to published 32 expression levels of bivalent genes were remarkably lower than active genes and were 23 similar to those of inactive genes. We also investigated the distribution of CA fragments 24 across genomic contexts bound by different TFs and found an overall consistent pattern 25 between CA and GE level. Notably, we observed substantial decrease of expression levels 26 of genes associated with binding of EZH2 while the accessibility level showed just a 27 moderate change (Fig. 1e). This pattern is similar to that of bivalent genes and is consistent 28 with the role of EZH2 which, as part of the repressive polycomb complex, catalyzes 29 H3K27me3. Thus, the combined signatures from scCAT-seq well reflect known processes 30 well and are useful to assess the transcriptional state of genes within different genomic 31 contexts. This approach is undoubtedly of high value for many biological applications, for 32 example, studying the heterogeneous transition of bivalent genes during development or 33 cellular reprogramming. 34

35
We further validated our approach by generating different batches of scCAT-seq profiles 1 from two additional ENCODE cell lines: HeLa-S3 cervix adenocarcinoma and HCT116 2 colorectal carcinoma cell lines (Supplementary Table 1). To test the feasibility of scCAT-3 seq in real tissue samples, we also generated profiles from two lung cancer patient-derived 4 xenograft (PDX) models (Supplementary Table 1). One is derived from a moderately 5 differentiated squamous cell carcinoma patient (PDX1) and the other one from a large-cell 6 lung carcinoma patient (PDX2). Principal components analysis (PCA) on both CA and GE 7 profiles resulted in separation of cells from different origin (Supplementary Fig. 2a,b). A 8 comparison of our datasets with published profiles revealed that the differences across 9 protocols and batches had a substantially smaller effect than difference across cell types 10 ( Supplementary Fig. 2c,d). 11 12

Establishment of regulatory relationships between CREs and genes in single cells. 13
Next, we explored the dynamic associations between the two omics layers across single 14 cells. We first tested the correlation between accessibility level of single CREs and their 15 expression of the putative target genes in each of the three cell lines, and the hypothetical 16 cell population merged from them. As expected, we identified remarkably more positive 17 correlations (Pearson correlation > 0; FDR < 10%) than negative correlations 18 ( Supplementary Fig. 3a), which is consistent with the known relationship between CA and 19 GE in bulk profiles 21 . 20 21 An earlier study showed the co-variability of accessibility between CREs across single cells 22 defines regulatory domains highly concordant with observed chromosome compartments, 23 which provides an alternative approach to the discovery of regulatory links 10 . However, it 24 still remains impossible to directly infer the transcriptional outcomes of each chromatin 25 accessible region. Given the overall positive correlation between CA and GE, we reasoned 26 that the co-variability between accessibility of individual elements and expression of genes 27 could enhance discovery of regulatory links that influence transcription. To this end, while 28 employing the reported strategy using scATAC-seq 10 (strategy 1, Fig. 2a), we proposed two 29 additional strategies for inferring regulatory relationships (strategy 2 and 3, Fig. 2a). For 30 strategy 1 and 2, regulatory relationships between chromatin accessible regions and target 31 genes were identified based on scATAC-seq and scCAT-seq data, respectively. Based on 32 scATAC-seq data, regulatory relationships for every gene were assigned when the 33 Spearman correlation of the accessibility of CREs located at the promoter and distal peaks 34 was above 0.25 (strategy 1, Fig. 2a and Supplementary Methods). Likewise, for the 35 6 scCAT-seq data, the regulatory links were assigned if the Spearman correlation between 1 the GE and the accessibility of distal CREs was above 0.25 (strategy 2, Fig. 2a and  2 Supplementary Methods). However, these regulatory relationships are defined across all 3 cells. In order to more accurately depict the regulatory relationship between chromatin and 4 genes, in strategy 3, single-cell-specific regulatory relationships between genes and their 5 nearby accessible regions were assigned using the scCAT-seq data as follows: i) 6 identification of active TFs for every cell by SCENIC 22 using the normalized GE matrix; ii) 7 identification of active accessible regions by matching the binding motifs of active TFs to 8 accessible chromatin regions; and iii) assignment of regulatory relationships after applying 9 a Wilcoxon test to determine if the presence of a nearby active accessible region was 10 associated with a significant change in the target GE (p-value < 0.05) (Fig. 2a and  11 Supplementary Methods). 12 13 By applying the 3 strategies to single cells of the 3 cell lines, we found that strategy 3 14 identified the largest number of regulatory relationships (62,769), compared to strategy 1 15 (46,813) and strategy 2 (21,219) (Fig. 2b). Over 1/3 of the regulatory relationships from 16 scATAC-seq based method (strategy 1) were shared by those from scCAT-seq based 17 method (strategy 2 and 3), suggesting strong synergistic effects between regulation at 18 chromatin and transcriptome levels. Nevertheless, although a similar correlation approach 19 was used in strategies 1 and 2, strategy 2 identified a lower number of regulatory 20 relationships, suggesting a possible decoupling between accessibility at the promoter and 21 the expression of the gene. Notably, we also observed a large fraction of regulatory 22 relationships specifically identified by each method, which suggests that different 23 information can be obtained from single-omics and combined analysis. 24

25
To assess the accuracy of the regulatory links inferred by each method, we next counted 26 the regulatory relationships that could be verified by chromatin interaction analysis by 27 paired-end tag sequencing (ChIA-PET) 23 . Encouragingly, using the ChIA-PET interactions 28 of the three widely used cell types (K562, HeLa-S3 and HCT116) 24 , we observed higher 29 proportion of validations in scCAT-seq based method (strategy 2 and 3) than that in 30 scATAC-seq based method (strategy 1) in all 3 cell types (Fig. 2c). These suggest that the 31 co-variability between CA and GE layers could better reflect higher-order chromatin 32 structure than co-variability between CREs. One explanation is that regulatory relationships 33 inferred from scATAC-seq may result from either chromatin interactions or from co-binding 34 of master TFs without interaction, while those inferred from scCAT-seq could be considered 7 to be "functional" regulatory relationships as include information from both chromatin 1 interactions and co-binding of master TFs. Therefore, based on the largest number of 2 validated regulatory relationships, strategy 3 outperformed the other strategies (hereafter, 3 the "regulatory relationship" indicates those identified only by strategy 3). The distribution of 4 distance between each pair of peak and gene in all regulatory relationships showed higher 5 enrichment in proximal regions than distal regions ( Supplementary Fig. 3b), suggesting 6 that GE tends to be regulated by proximal elements which is consistent with earlier 7 findings 25 . 8 9 To assess whether the regulatory relationships in each single cell reflect cell type-specific 10 features, we generated a binary matrix where columns represent single cells, and rows 11 represent all identified regulatory relationships between accessible sites and genes, and the 12 entries indicate the on or off state of each regulatory relationship in each cell. We applied a 13 non-negative matrix factorization (NMF) method, implemented in the R package Bratwurst 26 , 14 to decompose the matrix into different signatures that could distinguish single cell identities. 15 As expected, NMF clustering of the regulatory relationships identified signatures containing 16 numerous cell type-specific regulatory relationships, resulting in clear separation of the 3 17 cell types (Fig. 2d,e and Supplementary Fig. 3c). For example, SAMSN1 is a known 18 oncogene, preferentially expressed in the blood cancer, multiple myeloma 27 . We observed 19 highly specific regulatory relationships around SAMSN1 in K562, a myelogenous leukemia 20 cell line (Fig. 2e), revealing a strong association between its expression and accessibility of 21 CREs. This observation again reconfirmed the importance of epigenetic mechanisms during 22 progression of tumors. Likewise, we generated regulatory relationship matrix for single cells 23 from PDX tissues and clustering of the matrix clearly separated these two type of cells (Fig.  24 2f,g and Supplementary Fig. 3d). Interestingly, we also observed a subpopulation of cells 25 showing specific regulatory relationships in PDX2 (Fig. 2f,g), likely reflecting the regulatory 26 heterogeneity present in real tissues. 27 28 Integrated single-cell epigenome and transcriptome maps of human pre-implantation 29

embryos. 30
We next explored the potential of scCAT-seq in the characterization of single cell 31 identities in continuous developmental processes. The human pre-implantation embryo 32 development is a fascinating time that involves dramatic changes in both chromatin state 33 and transcriptional activity. However, it has only been investigated at either the chromatin 34 or the RNA level due to the lack of truly integrative approaches 28 . By using clinically 35 discarded human embryos (Supplementary methods), we generated scCAT-seq profiles 1 for a total of 110 individual cells, and successfully obtained 29 quality-filtered profiles from 2 morula stage and 43 from blastocyst stage (success rate 65.5%) (Fig. 3a, Supplementary  3 Table 1). To explore the regulation relevant to each stage, we 4 identified ~100K regulatory relationships and generated a matrix of regulatory relationships 5 across all single cells as described above. NMF clustering analysis of the matrix showed 6 separation of all single cells into two main groups (group 1 and 2), corresponding to these 7 two stages (Fig. 3b). The heatmap of exposure scores to each signature revealed activation 8 of regulatory relationships of pluripotency markers (such as NANOG and KLF17) in morula, 9

Fig. 4a and Supplementary
and trophectoderm (TE) markers (such as CDX2 and GATA3) in blastocyst stage 28 (Fig.  10 3b,c and Supplementary Fig. 4b,c), which strongly suggests that the expression of these 11 markers is activated/maintained by epigenomic states 28 . 12

13
The transition between cell fates largely depends on TFs, which bind to CREs and recruit 14 chromatin modifiers to reconfigure chromatin structure 15 . Single-cell chromatin accessibility 15 data provides a great opportunity to find the key TFs in individual cells 10, 17 . However, TFs 16 of the same family often share similar motifs, which makes it difficult to determine the key 17 TFs of functional specificity. Previous efforts have proposed computational algorithms to 18 integrate CA and GE data, but the accuracy remains uncertain because the analyses are 19 based on separate multi-omics datasets 16,17 . 20

21
We reasoned that functionally relevant master TFs in each cell type should be determined 22 by integrated omics data obtained by scCAT-seq. We applied chromVAR 29 , a method for 23 inferring TF accessibility with single cell CA data, to compute the deviations of known TFs 24 across all single cells. This method identified TF motifs with high variances (Supplementary 25 Fig. 4d), dividing all single cells into two main groups ( Supplementary Fig. 4e), in 26 agreement with the clustering results on regulatory relationships (Fig. 3b). We observed 27 that motifs from the POU-Homebox, SOX-HMG and KLF-zf families showed high deviation 28 score in cells of the group 1, while motifs from GATA-zf and GRHL-CP2 families showed a 29 high deviation score in cells of the group 2 (Fig. 3d). To determine the master TF from each 30 family, we next integrated the expression level of these TFs. Interestingly, we found that the 31 well-known pluripotency factors (such as NANOG, POU5F1, SOX2, KLF4, TBX4), as well 32 as early markers (such as KLF17), both showed relatively high levels of CA and GE in cells 33 of the group 1, whereas other TFs of the same families (such as POU3F1, SOX5, KLF7 and 34 TBX1) showed opposite trends (Fig. 3d). These results are highly consistent with the 35 9 features of the pluripotent morula cells, which are the main component of group 1. We also 1 found GATA3, but not GATA4 and GATA6, to show a specific role in the group 2, which 2 contains cells from the blastocyst stage. This is in agreement with the important role of 3 GATA3 during differentiation of trophoblast 30 . In addition, we also observed similar results 4 from other TFs of the same families, such as SOX9, HOXD4, MEF2C and GRHL1, 5 suggesting they likely playing critical roles in these two groups (Fig. 3d). Overall, these 6 results suggest that our integrated method could increase the power of discovery of 7 functionally relevant TFs at single-cell resolution. showed a comparable small proportion using the known, lineage-specific markers NANOG 19 (EPI), SOX17 (PE) (Fig. 3e). 20

21
We next sought to validate the ICM-like cells by molecular features based on their two omics 22 signatures. It is known that OCT4 is initially expressed in all cells within the ICM, and 23 becomes restricted to the EPI in the late blastocyst 31 . Interestingly, although OCT4 was not 24 a general marker of the blastocyst stage (Fig. 3d), it has a higher deviation score in the 3 25 single cells compared to other cells in the blastocyst (Supplementary Fig. 4f). Notably, 2 26 of them (#504 and #539) showed even higher deviations from the other single cell (#522) 27 ( Supplementary Fig. 4f), which may describe the segregation into EPI (#504 and #539) 28 and PE (#522) lineages (hereafter termed "EPI-like" and "PE-like" cells). 29

30
We next attempted to support this hypothesis by identifying the key TFs in the EPI-or PE-31 like cells. Encouragingly, in addition to enrichment of OCT4, we also observed specific 32 enrichment of the well-known EPI specific regulators, such as NANOG, and KLF17, in EPI-33 like cells (Fig. 3f), while the PE-like cell showed high activity of the well-known PE 34 regulators, such as SOX17, HNF1B and FOXA2 (Fig. 3f). The other members of the same 35 families (such as SOX9, FOXA1 and HNF1A) are not likely to be the key regulators because 1 of the inconsistent patterns of CA and GE. Further supporting this conclusion, the well-2 known non-TF markers were also found to be highly specific to each cell type, including 3 GDF3, TGDF1, DPPA2, DPPA5, ARGFX in EPI-like cells and BMP2, PDGFRA, FN1, 4 COL4A1 and LINC00261 in PE-like cells 33 (Fig. 3f). Although the EPI-and PE-like cells are 5 similar to morula cells, the above markers tend to be transcriptionally active in EPI-or PE-6 like cells based on CA and GE profiles. (Supplementary Fig. 4g,h), suggesting distinct 7 pluripotent states in the morula and blastocyst stages. Taken together, these results indicate 8 that our integrated approach can faithfully identify the two distinct subtypes from the same                                  TF6  TF7  TF4  TF5  TF2  TF9  TF1  TF3  TF8  TF2  TF4  TF5  TF6  TF7  TF8   Match binding motifs of active  TFs to ATAC-seq peaks   Identify Blastocyst EPI-like (#504 and #539) PE-like (#522)   Gene expression