Allele specific chromatin signals, 3D interactions, and motif predictions for immune and B cell related diseases

Several Genome Wide Association Studies (GWAS) have reported variants associated to immune diseases. However, the identified variants are rarely the drivers of the associations and the molecular mechanisms behind the genetic contributions remain poorly understood. ChIP-seq data for TFs and histone modifications provide snapshots of protein-DNA interactions allowing the identification of heterozygous SNPs showing significant allele specific signals (AS-SNPs). AS-SNPs can change a TF binding site resulting in altered gene regulation and are primary candidates to explain associations observed in GWAS and expression studies. We identified 17,293 unique AS-SNPs across 7 lymphoblastoid cell lines. In this set of cell lines we interrogated 85% of common genetic variants in the population for potential regulatory effect and we identified 237 AS-SNPs associated to immune GWAS traits and 714 to gene expression in B cells. To elucidate possible regulatory mechanisms we integrated long-range 3D interactions data to identify putative target genes and motif predictions to identify TFs whose binding may be affected by AS-SNPs yielding a collection of 173 AS-SNPs associated to gene expression and 60 to B cell related traits. We present a systems strategy to find functional gene regulatory variants, the TFs that bind differentially between alleles and novel strategies to detect the regulated genes.

6. The reads are counted for each allele at each heterozygous position (hzSNPs) for the two genomes.
6b. Reads are also counted across 7 different cell lines summing up the read counts at heterozygous (hz) positions (7LCLs) 7. Only hzSNPs with reads aligning on both alleles are retained.
8. Each hzSNP is filtered by region-maps from ChromHMM and tfNet. The retained hzSNPs fall in putative regions associated with enhancers, promoters, insulators regions, or a mix of these. 9. hzSNPs located in centromeres, telomeres and CNVs defined in GM12878 are excluded.
10. A binomial test is applied to determine hzSNPs which show a statistical significant difference between reads aligned to each allele, and the results were corrected for multiple testing with the Benjamini-Hochberg algorithm.
11. All hzSNPs with a binomial test p-value above 0.05 are removed together with hzSNPs with fewer than 10 reads on either allele.
12. The remaining hzSNPs, called AS-SNPs, are intersected with SNPs from the 1000 genomes project in order to retrieve the allele frequency (AF) for AS-SNPs and define common (AF ≥ 0.01) or rare (AF <0.01) AS-SNPs.
13. AS-SNPs in ENCODE blacklisted regions are removed.
14. AS-SNPs are assigned their RegulomeDB scores to help in the prioritization of putative functional variants.
15. Common and rare AS-SNPs are intersected with collections of eQTL and GWAS SNPs B cell specific and SNPs in high LD (r2 > 0.8) with these.
16. The rare AS-SNPs are also intersected with genomic windows of 300 bp around common AS-SNPs. Rare AS-SNPs harbored in the same elements of common AS-SNPs are labelled 'extended' SNPs (see Table 2 in main text).
After the computations are completed, there are multiple scripts to describe the final results for each Excel tables with allow to intersect collections of AS-SNPs with data from any other study providing a .bed formatting.

High-throughput chromosome conformation capture with targeted sequence capture (HiCap)
Lymphoblastoid cells GM12878 were grown in a controlled environment of 37°C under 5% of carbon dioxide in RPMI-1640 growing medium with 2mM L-glutamine and 15% of fetal bovine serum. A blend of penicillin and streptomycin was used to prevent bacterial contamination.
High-throughput chromosome conformation capture followed by targeted sequence capture (HiCap) was performed on cross-linked GM12878 nuclei pellets as detailed below. The experiments were carried out in three biological replicates with approximate number of 5 million cells per experiment.
The cells were cross-linked with 1% of formaldehyde for 10 minutes followed by cell lysis and nuclei isolation. The chromatin was then solubilized by sodium dodecyl sulfate (SDS) dilution and enzymatically digested with 1μL/μg of fast digest MboI (↓GATC; Thermo Fisher Scientific) for 4.5 hours at 37°C. SDS was quenched with Triton X-100 before the enzymatic reaction. Biotin labeling of digested DNA ends was employed for later enrichment of the aimed ligation product. A mix containing biotin-14-dATP with the reaction catalyzed by Klenow fragment of DNA Polymerase I filled the protruding 5' overhangs created by the restriction enzyme. The enzymatic activities were quenched by a brief incubation at 75°C in a presence of 10mM of EDTA. Consequently the material was ligated with 12 Weiss units of T4 DNA ligase (New England Biolabs) for 4.5 hours at 16°C favoring intra-molecular blunt end ligation of cross-linked fragments. The ligation product was then purified using phenol-chloroform-isoamyl alcohol (25:24:1) followed by precipitation with sodium acetate pH 5.2 and absolute ethanol. RNA contamination was removed by RNase A treatment for 1 hour at 37°C. Subsequently the purified chimeric DNA was treated with T4 DNA Polymerase to remove the unligated ends containing biotin mark (reaction artifacts). Resulting product was then sheared for 6 cycles of 60 seconds using the sonication system by Covaris Inc. with these settings:   AS-SNP rs12032504 and rs11811536 are in LD with the GWAS SNPs rs10911390 associated to Systemic lupus erythematosus (SLE). The AS-SNPs are located ~2kb apart in a genomic region with a multi loop TAD architecture defined by HiC data (purple interactions) that narrows the possible target genes to five candidates: NCF2, SMG7, ARPC5, APOBEC4 and RGL1. HiCap data (green interactions) suggest that the regulatory elements harboring the two AS-SNPs interact with the promoters of two of the 5 genes: APOBEC4 and RGL1. The distal probe region (~5kb) is highlighted in gray. Histone modifications tracks for H3K4me1, H3K4me3 and H3K27ac were retrieved from the ENCODE project for the B cell line GM12878 (scaled using vertical viewing range settings) as well as the DNaseI hypersensitive clusters (DHS). The sequence logo for the TF binding motifs of RUNX3 and STAT5B have been determined from PMMs to overlap rs11811536 while rs12032504 overlaps motifs for TFs NDF2 and Z280D (not shown) Figure. S4 Different nature of the AS signals obtained using raw reads from ChIP-seq experiments (a). AS-SNP defined by ChIP-seq reads of a TF. (b). AS-SNPs defined by ChIP-seq reads of histone modifications for short regulatory elements (up to ~200 bp). (c). AS-SNPs defined by ChIP-seq reads of histone modifications for longer regulatory elements (> 300 bp). Using shorter reads or investigating longer elements like promoters will not provide a complete coverage of the regulatory element and in this case the defined AS-SNPs would either represent the real functional variant (c top) or be the echo of an AS event happening elsewhere in the regulatory element (c bottom).