A functional genomics pipeline identifies pleiotropy and cross-tissue effects within obesity-associated GWAS loci

Genome-wide association studies (GWAS) have identified many disease-associated variants, yet mechanisms underlying these associations remain unclear. To understand obesity-associated variants, we generate gene regulatory annotations in adipocytes and hypothalamic neurons across cellular differentiation stages. We then test variants in 97 obesity-associated loci using a massively parallel reporter assay and identify putatively causal variants that display cell type specific or cross-tissue enhancer-modulating properties. Integrating these variants with gene regulatory information suggests genes that underlie obesity GWAS associations. We also investigate a complex genomic interval on 16p11.2 where two independent loci exhibit megabase-range, cross-locus chromatin interactions. We demonstrate that variants within these two loci regulate a shared gene set. Together, our data support a model where GWAS loci contain variants that alter enhancer activity across tissues, potentially with temporally restricted effects, to impact the expression of multiple genes. This complex model has broad implications for ongoing efforts to understand GWAS.

HSV plots All Hue-Saturation-Value (HSV) analyses are developed from code originally published in Siersbaek et al21. Value (V) indicates the maximum log2(TPM/cpm) for a given gene at any time point, and so is defined as: V=max(C_t) Saturation (S) indicates the maximal fold change between any time points and is defined as: S=1-(min_t (C_t))/V Hue (H) indicates the pattern of change in gene expression across time, and is defined as: H=60*(2+(C_0+C_2-C_16-V)/(V*S))*(C_2-C_0)/(|C_2-C_0 |) For visualization purposes, the values of V and S were scaled between 0 and 1 based on rank. For the purposes of HSV visualization, RNA-seq gene counts were converted to TPM to normalize for transcript length and then normalized by library size using trimmed mean of M-values (TMM) normalization and normalized by transcript length. Mean TPM was calculated at each time point, and all genes with mean log2(TPM) < 1 at any time point were removed from further analysis. For ATAC-seq data the union set of significant peaks across time points was obtained, and peaks of a uniform length of 1kb were obtained by centering around the summit of the highest peak per peak locus in the union set. The counts per time point mapping to these 1kb union peaks were obtained and transformed to log2cpm format, normalizing by library size(defined as total number of reads in peaks per sample). Hue, saturation, and value were calculated using the same equations as with gene expression, using normalized log2(cpm) values as input. For promoter capture Hi-C data, counts were normalized by library size using (TMM) normalization and transformed into counts-per-million (cpm). Hue, saturation, and value were calculated using the same equations as with gene expression, using normalized cpm values as input.
Pearson's r correlation analysis Pearson correlations were performed on the normalized datasets used for HSV analysis with the cor function in R.
Fuzzy-c means clustering Gene-level read counts were quantified in each technical replicate at each time point directly using salmon(v0.7.2), correcting for sequencespecific bias and using a gene list derived from GENCODE release grch37.v19. Gene-level read counts were transformed into cpm and any gene with cpm < 1 in more than three samples across all time points was removed from further analysis. The data were normalized to account for library size using TMM normalization. Linear models testing pairwise differential expression between any two time points were then build using limma and tested using a moderated t-test accounting for mean-variance dependence and increased dispersion in limma. All genes with significant differential expression between any two time points were included in the clustering analysis. Raw gene-level counts from salmon were normalized to account for transcript length and scaled to account for differences in gene expression across genes. Fuzzy c-means clustering was performed in R using the e1071 package. A gene was assigned to the cluster for which it had the highest membership if 1) its membership score was above 0.3 for the averaged replicates and 2) above 0.2 for each individual replicate. The top three clusters were defined by the highest average membership score.
LDSC Partitioned Heritability Analysis Heritability per chromosome was calculated via LD score regression analysis using the ldsc command line tool (v1.0.0) using Locke et al 2012 BMI GWAS summary statistics and Yengo et al 2018 downloaded from the GIANT consortium. Briefly, .bim files from 1000 Genomes Phase 1 were downloaded and annotation files were created for each chromosome where chromosome was treated as a binary annotation. LD scores were then computed from these annotation files for input into partitioned heritability analysis. Summary statistics were filtered to contain only HapMap3 variants as advised.
Transcription factor motif analysis All regions identified to be significant enhancers were included in this analysis. Regions were expanded to be 175bp (size of enhancers tested in MPRA) and then if two regions overlapped they were then merged so they would not become overrepresented in the analysis. The program findMotifsGenome.pl from HOMER (v4.8.3) was then used in addition to the -size flag to identify motifs that were overrepresented in significant MPRA enhancers from each cell line. These were compared to a size and base composition matched set of background sequences computed by HOMER to determine significance and p value. All p values from each cell line are included in the supplementary tables.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sample size
Calling MPRA EMVAR interactions with promoters MPRA EMVars were considered to interact with a promoter if the distal end of the promoter interaction came within 1kb of the single base pair SNP location. EMVar SNP location and cHi-C BEDPE files were overlapped using the BEDtools (v2.27.1) pairToBed function.

Enrichment analysis:
The promoter-distal ends of interactions are enriched for functional histone mark (H3K27ac and H3K4me1) ChIP-seq peaks and ATAC-seq peaks compared to a distribution of randomly chosen, number-matched set of non-promoter MboI fragments within mappable genomic regions (N=100 iterations). The fold change of the observed overlap over our 100 randomized sets is presented. ChIP-seq datasets were obtained from Adipose Nuclei (E063) and Fetal Brain (E081)  No statistical methodology was used to determine the necessary sample size for the following experiments. However, we decided on replicate number based on previous publications which had success seeing statistical significance for measures of effect similar to what we wished to achieve with these experiments. These replicate numbers were based on the relative variance in the assay as well as the effect size we wished to achieve.
Promoter Capture Hi-C: Two technical replicates for each time point were performed derived from two unique differentiations for both SGBS and hypothalamic neurons. Each technical replicate was analyzed alone, and additionally the technical replicate raw sequencing data was merged and analyzed to produce merged datasets. Merged datasets and individual replicate datasets were used in downstream analyses.
ATAC-seq: Two technical replicates were performed for each time point derived from two unique differentiations for both SGBS and hypothalamic neurons. The two replicate sequencing data was merged and analyzed to produce merged datasets which were used in downstream analyses.

Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. Methods n/a Involved in the study

ChIP-seq
Flow cytometry MRI-based neuroimaging RNA-seq: Three technical replicates were performed derived from three unique differentiations for both SGBS and hypothalamic neurons.
MPRA: MPRA experiments were comprised of two soft biological replicates with 2-3 technical replicates per soft biological replicate. In total GT1-7 cells had 6 replicates, 3T3-L1 cells had 7 replicates, HT22 cells had 5 replicates, SGBS Day 0 cells had 6 replicates, and SGBS Day 8 cells had 5 replicates. A replicate was considered "biological" if the input DNA library was separately cloned from the beginning from our Agilent oligonucleotides.
Luciferase Assays: The luciferase assays used for MPRA validation and allele specific enhancer activities (supplementary figure 3 and Figure 4d) had between 3-12 replicates per construct, where different DNA preps were used and the cells were transfected, collected, and analyzed on different days. This varied based on technical variance within cell types, as different cell types had variable levels of transfection leading to higher variance and thus required more replicates to achieve sufficient power.
Enhancer deletion iPSC differentiation: This experiment had 3-4 homozygous biological replicate deletion clones of each genotype that were collected during four stages of differentiation from the iPSC stage to early neuronal precursor stage.
No data were excluded from the analyses.
To generate reference omics maps we used three (RNA-seq) or two (ATAC-seq, HiC-seq) replicates of each assay at each time point representing two or three independent differentiations. Each replicate was performed successfully.
As described above, all omics experiments (RNA-seq, ATAC-seq, HiC-seq, and MPRA) were carried out and visualized with PCA to verify that cell types, differentiation stages, or perterbations (e.g genome editing) clustered as expected. Key cellular expression markers for each cell type were confirmed, and enhancer predictions were replicated with a variety of orthogonal techniques (histone mark enrichment, luciferase assay, ATAC-seq enrichments, etc) throughout the manuscript.
Enhancer deletion lines were confirmed for homozygous deletions via a combination of PCR genotyping and sanger sequencing. CRISPRi knockdown was confirmed using a positive control set of guides targeting the promoter of GAPDH. MPRA findings were validating using luciferase assays as described above.
All experiments were performed with caution and randomized throughout to reduce potential for batch effects to emerge during analysis. For example, during the RNA-seq analysis, covariates such as RIN score, RNA-extraction batch, read mapping, FACS sorting date and RNA-library prep batch were correlated with all PCs that explained greater than 5% of the variance in the RNA-seq data using a linear model. If the covariate significantly correlated with one of these top PC's it was adjusted for when determining significantly differentially expressed genes.
Covariates that were included in the model are outlined in the methods section.
The experimental work in this manuscript was conducted by Amelia Joslin and/or Debora Sobreira. Data preprocessing and significance calling was conducted by Grace Hansen and/or Noboru Sakabe, who determined data quality and were made blind to experimental groups or perturbations as much as possible.