Genomic programming of IRF4-expressing human Langerhans cells

Langerhans cells (LC) can prime tolerogenic as well as immunogenic responses in skin, but the genomic states and transcription factors (TF) regulating these context-specific responses are unclear. Bulk and single-cell transcriptional profiling demonstrates that human migratory LCs are robustly programmed for MHC-I and MHC-II antigen presentation. Chromatin analysis reveals enrichment of ETS-IRF and AP1-IRF composite regulatory elements in antigen-presentation genes, coinciding with expression of the TFs, PU.1, IRF4 and BATF3 but not IRF8. Migration of LCs from the epidermis is accompanied by upregulation of IRF4, antigen processing components and co-stimulatory molecules. TNF stimulation augments LC cross-presentation while attenuating IRF4 expression. CRISPR-mediated editing reveals IRF4 to positively regulate the LC activation programme, but repress NF2EL2 and NF-kB pathway genes that promote responsiveness to oxidative stress and inflammatory cytokines. Thus, IRF4-dependent genomic programming of human migratory LCs appears to enable LC maturation while attenuating excessive inflammatory and immunogenic responses in the epidermis.

probesets differentially regulated by TNF. Lines (edges) represent the similarity between transcripts, circles (nodes) represent genes. DGE: 1156, FDR<0.05, |LogFC|>1. 7 main clusters were identified (gene n>8), denoted by different colours. Enrichment in biological processes was done using ToppGene on-line tool significance denoted by FDR (Benjamini-Hochberg) corrected p-value is shown. d) Expression profiles in 7 main clusters (gene n>8) identified with transcript-to-transcript clustering, (BioLayout Express3D, r= 0.85; MCL = 1.7) of 1,156 probesets differentially regulated by TNF in human migrated LCs. Source data are provided as a Source Data file e) Gene expression of PSME2 and CAV1 in migrated LC assessed by qPCR (expression normalised to house-keeping gene YWHAZ (2-dCT) before (grey bars) and following stimulation with TNF (black bars) either during or post-migration. N=1-3 independent skin donors, in duplicate. Line denotes median. f) Heat map of genes included in antigen presentation class I signature from Reactome database. Log (2) FPKM median expression values are shown for each gene in the signature. The black line denotes expression cut-off for detection. RNA-seq n=3 independent donors. Source data are provided as a Source Data file g) A representative example of intracellular and surface protein expression assessed by flow cytometry in control LCs (red, 24h in medium) LC stimulated 24h with TNF (blue) vs isotype ctrl (grey)

Supplementary Figure 3. scRNA-seq analysis of migrated human LCs
950 single migrated epidermal cells highly enriched in LC were subjected to Drop-seq encapsulation and single cell RNA-sequencing Alignment, read filtering, barcode and UMI counting were performed using kallisto-bustools followed by clustering within the pythonbased Scanpy.
a) UMAP plot of cell cycle analysis indicating that while no cell cycle synchronisation was visible in cluster 1-4, cluster 5 contained exclusively cells in G2S phase (orange). b) Coverage of gene expression for 20 genes with the highest expression levels detected across all the cells. Percentage of total counts shown, bar indicates median with range for each gene. c) Violin plots representing expression of cluster defining markers identified with ScanPy analysis, Leiden r=0.2, across cell population. d) Violin plots representing expression of genes implicated in key biological processes enriched in subclusters of LCs. Genes involved in antigen presentation (left panel), genes involved in antigen processing (middle panel) and genes functioning in oxidative phosphorylation (right panel) are displayed. Each plot shows distribution of CPTT normalised expression values of indicated transcript in a given LC subpopulation. e) Projection of scRNAseq LC data onto a reference set of newly classified blood dendritic cells was carried out using Gene Set Variation Analysis (GSVA) based on enrichment score. The heatmap shows the GSVA enrichment score for each LC transcriptome.

Supplementary Figure 4. Chromatin landscape of migrated LCs enriches for EICE, AICE and ISRE motifs
a) Proportion of DEG with increase (M>0) H3K27Ac mark at 2h and 24h in clusters of coexpressed genes up-regulated early (2h, clusters 3) and late (24h, cluster 2) during stimulation with TNF. Changes in H3K27Ac acetylation were calculated using MANorm algorithm (MACS2, 1 ) embedded in BioWardrobe tool 2 and genes filtered to include unique common entry across the biological replicates (n=3 independent donors). Genes with detected changes in acetylation were intersected with DEGs identified by EdgeR analysis. Source data are provided as a Source Data file b) UCSC genome browser tracks of H3K27Ac histone mark for B2M from human LC genomic programme. Acetylated area in the promoter region shown by blue peaks along the gene track. c) UCSC genome browser tracks of H3K27Ac histone mark for PSME2 from human LC genomic programme. Acetylated area in the promoter region shown by blue peaks along the gene track.

Bulk RNA-seq data analysis
Quality control for FASTQ files with raw sequence data was done using FASTQC tool [FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc ] Adapter sequences and low quality reads were trimmed using Trimmomatic 3 . High-quality reads were mapped to the human genome (hg19) using TopHat (version 2.0.9, 4 ) and, following the removal of multimapping reads, converted to gene specific read counts for annotated genes using HTSeq-count (version 0.5.4) 5 . Raw counts from RNA-Seq were processed in Bioconductor package EdgeR 6 , variance was estimated and size factor normalized using TMM. Genes with minimum 2 reads at minimum 50% samples were included in the downstream analyses.
Gene ontology analysis was done using ToppGene on-line tool 8 .

scRNA-seq data analysis
Samples were de-multiplexed with bcl2fastq tool from Illumina. Alignment, read filtering, barcode and UMI counting were performed using kallisto-bustools 9 . All further analyses were run using the python-based Scanpy 10 except where stated otherwise. High quality barcodes were selected based on the overall UMI distribution using emptyDrops 11 , filtering criteria was adjusted to match estimated the number of true cells. To remove low quality cells, we filtered cells with a high fraction of counts from mitochondrial genes (20% or more) indicating stressed or dying cells. In addition, genes with expression in less than 5 cells were excluded.
Highly variable genes were selected using distribution criteria: min_mean=0.0125, max_mean=6, min_disp=0.6. A single-cell neighbourhood graph was computed on the first principal components that sufficiently explain the variation in the data using 10 nearest neighbours. Uniform Manifold Approximation and Projection (UMAP) was run for visualization. Leiden algorithm 12 was used to identify cell clustering within samples (Leiden r = 0.2, n_pcs=4, n_neighbours =10). Diffmap algorithm was used for pseudotrajectory analysis.
Differentially expressed genes were identified using linear model scDiffExlimma (SingleCellTK package in R 13 , FDR corrected p value <0.05 used as a cut-off criteria following scnorm normalisation 14 . Gene enrichment analysis for DEGs was done in ToppGene suit (FDR corrected p value <0.05 cut-off).
To project single cell LC transcriptomes onto known gene signatures from dendritic cell populations we used a set of marker genes from monocytes and DC from Villani et al. as a reference set 15 . Over-representation analysis was performed using the hypergeometric test using GSVA 16 . The results were plotted as a heat map using the Ward method with Euclidean distances.

ChIP-seq data analysis
ChIP-seq data analysis was performed using pipelines implemented in the BioWardrobe suite 2 BAM-formatted hg19 aligned ChIPseq reads were used for peak calling MACS2 (version 2.1.1. 1 was used to estimate fragment size, identify and annotate peaks. Changes in histone modification profiles were assessed with MAnorm algorithm implemented in BioWardrobe. ChiP-seq profiles at a distance -3000bp -+3000bp from TSS were generated using ChIPseeker package in R. Common regions with histone modification were identified using findoverlaps function, DiffBind package (R environment)

Transcription factor binding site motif enrichment analysis
We performed TF motif enrichment analysis on ChIP-seq peak sets using the HOMER software package 17 . HOMER calculates the statistical enrichment in a set of input DNA sequences (here, ChIP-seq peak sequences) for a large set of position weight matrix (PWM) TF binding models. Subsequently, putative binding sites for enriched motifs (EICE, AICE, and ISRE) were identified in the H3K27ac T0 dataset using custom scripts encoding the standard log likelihood PWM scoring function 18 . DNA sequences scoring at least 70% of the best possible loglikelihood score were recorded as putative binding sites.

Intersection of ChIP-seq peaks with public genomics datasets
To identify transcription factor binding events and other epigenetic marks that overlap with our ChIP-seq data, we used our Regulatory Element Locus Intersector (RELI) computational method 19 , which overlaps a set of genomic locations (e.g., peaks from a ChIP-seq experiment) with a large collection of functional genomics datasets. We created a library of ~5,000 datasets by compiling data (ChIP-seq for TFs and histone marks, DNase-seq, ATAC-seq, etc.) from a variety of sources, including ENCODE 20 , Cistrome 21 , PAZAR 22 , Re-Map 23 , and Roadmap Epigenomics 24 . As input, RELI takes a set of genomic loci (e.g., H3K27ac ChIP-seq peaks at T0). These loci are systematically intersected with each functional genomics dataset, and the number of input regions overlapping each dataset by at least one base are counted.
Next, a p-value describing the significance of this overlap is estimated using a simulation-