Integration of Alzheimer’s disease genetics and myeloid genomics identifies disease risk regulatory elements and genes

Genome-wide association studies (GWAS) have identified more than 40 loci associated with Alzheimer’s disease (AD), but the causal variants, regulatory elements, genes and pathways remain largely unknown, impeding a mechanistic understanding of AD pathogenesis. Previously, we showed that AD risk alleles are enriched in myeloid-specific epigenomic annotations. Here, we show that they are specifically enriched in active enhancers of monocytes, macrophages and microglia. We integrated AD GWAS with myeloid epigenomic and transcriptomic datasets using analytical approaches to link myeloid enhancer activity to target gene expression regulation and AD risk modification. We identify AD risk enhancers and nominate candidate causal genes among their likely targets (including AP4E1, AP4M1, APBB3, BIN1, MS4A4A, MS4A6A, PILRA, RABEP1, SPI1, TP53INP1, and ZYX) in twenty loci. Fine-mapping of these enhancers nominates candidate functional variants that likely modify AD risk by regulating gene expression in myeloid cells. In the MS4A locus we identified a single candidate functional variant and validated it in human induced pluripotent stem cell (hiPSC)-derived microglia and brain. Taken together, this study integrates AD GWAS with multiple myeloid genomic datasets to investigate the mechanisms of AD risk alleles and nominates candidate functional variants, regulatory elements and genes that likely modulate disease susceptibility.

D � For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings � D For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes � D Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection Data analysis NA CHIP-seq and ATAC-seq data analysis To generate the epigenomic annotations FASTQ files were obtained from Sequence Read Archive (SRA). Technical replicates were merged and Bowtie2(v2.4.1) was used for alignment for both single and paired-end files. FASTQC (v0.11.8) was used for quality control of the files. Resulting SAM files were filtered by MAPQ score and duplicates were removed using samtools(v1.9). MACS2 (v2.1.0) was used to call peaks for ATACseq and Ch IP-seq files. Samtools mpileup function was used to quantify the number of reads that align to each allele.
Stratification into promoter/enhancer regions, stratification of ATAC-seq regions and partitioning heritability We used HOMER(v4.11) annotatePeaks.pl function to obtain the distance of the peaks from transcription start sites. We used bed map(v2.4.36) to find overlapping regions. HOMER findMotifsGenome.pl function was used to identify motifs that are enriched in ATAC-seq regions associated with active enhancer profiles. HOMER annotatePeaks.pl and findMotifsGenome.pl functions were used to identify ATAC-seq regions that are positive for the motif of interest. Position weight matrices from HOMER were used. We used LD Score regression (vl.0.0) to quantify to quantify the enrichment of AD risk alleles in functional annotations.
Colocalization and Summary data-based Mendelian Randomization {SMR) analyses Colocalization analyses were performed using coloc.abf function in coloc package (v3.2.1) in R (v3.5.3). To identify putative causal associations between epigenetic activity at active enhancers, gene expression and disease risk, we used SMR with default parameters. To identify the associations between chromatin QTLs (hQTLs) and gene expression, we set hQTLs as exposure and gene expression as outcome.

GCTA-Cojo conditoinal analyses
We used GCTA-COJO(v1.91.7beta) to conduct conditional analyses using IGAP GWAS summary statistics data and ADGC individual-level genotype data as a reference panel. Allele frequencies were estimated with ADGC using plink (v1.9). To conduct the conditional analysis we ran COJO with default parameters.
Prioritization of candidate causal variants For each locus we constructed LD blocks using the CLQ algorithm for block partitioning within the BigLD package (no version listed on github). We then filtered for variants that reside in active enhancers in monocytes, macrophages and/or microglia (with the exception of the SPPL2A locus, since these variants likely regulate a distal enhancer as reported in Figure 3c). We also conducted a motif disruption/ creation analysis on these variants and selected the variants that are predicted to strongly disrupt or create binding sites of transcription factors that are expressed in myeloid cells (TPM;,>l). We then screened the remaining variants for eQTLs in monocytes and macrophages from the Cardiogenics and Fairfax studies. We also used PAINTOR (v3.1) to conduct Bayesian fine-mapping in MS4A, ZYX and BINl loci. PAINTOR is a Bayesian fine-mapping method that leverages functional annotations through an Empirical Bayes prior. The input files for PAINTOR_v3.l were prepared as described on the PAINTOR website and ADGC GWAS summary statistics along with individual-level genotype data were used for fine-mapping.The reprocessed epigenomic annotations were used to quantify enrichment at each locus. Quantification of annotation enrichment and fine-mapping were performed as described in the companion website.To classify the annotations as enriched or not, we computed the relative probability for a SNP to be causal given that it resides in the annotation as described in the companion website. We deemed the annotation to be significant if the relative probability of a SNP to be causal given that it is in the annotation was greater than l. Once candidate causal variants were selected through both approaches, we conducted conditional analyses to make sure that they do indeed tag the majority of the GWAS signal in the locus.

Motif disruption analyses
To obtain evidence of SNPs that break the motif of a given transcription factor, we used motifbreakr, which is available as an R package.
We used HOCOMOCO to screen for motifs and a P-value of significance of 5*10-5 as previously advised by the authors of the package.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sample size
Sample sizes for: the GWAS (n= 74,046), hQTL(n=172) and eQTL(STARNET n=470, Fairfax n=432,Cardiogenics(mono/macro)n=849/684). These sample sizes were sufficient to detect genome-wide associations signals in AD risk loci and were hence sufficient for our analyses. We generated 3 iPSC-derived microglia lines for variant validation and 4 lines for RNA-seq analysis.
Data exclusions No data were excluded.

Randomization
The associations identified with the monocyte eQTL dataset from the Cardiogenics study were replicated in an independent monocyte eQTL dataset from the Fairfax study. Similarly, two independent macrophages eQTL datasets were used to replicate the findings, Cardiogenics and STARNET. Hence, we ran our analyses once in each monocyte and macrophage datasets, successfully replicating a large portion of AD risk genes within the same cell types.
Because of the nature of the analyses we utilized (colocalization, SMR), we did not require random sampling, hence, randomization is not applicable.

I I
The following studies obtained from GEO were used for the analyses presented in this paper:GSE29611, GSE85245, GSE100380, GSE66594. Data generated in this study are available through accession number GSE164315. DbGAP accession study number for the human microglia dataset is phs001373.vl.pl. The genotype and phenotype data from ADGC are available under phs000372.vl.pl dbGAP study accession number. IGAP data can be found here:http://web.pasteur-lille.fr/en/recherche/u744/igap/igap_download.php. Blueprint eQTL data can be found here:https://www.blueprint-epigenome.eu/. The Cardiogenics dataset can be requested on EGA using accession number EGAS00001000411. DbGAP accession study number for the STARNET eQTL dataset is phs001203.v1.p1. Summary statistics for Fairfax eQTL data can be obtained from ArrayExpress using accession number E-MTAB-2232.

Blinding
The only experiments where we tested the difference between groups were experiments where we tested the effect of the MS4A variants on open chromatin and expression. In these experiments, we did not have the knowledge of the genotype prior to testing. In ICC experiments, groups were not compared against each other , but rather microglial marker expression was confirmed.

Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. Sendai virus was used for reprogramming. The lines had normal and stable karyotypes. Pluripotency was assessed for OCT4, NANOG, TRAl-60, and TRAl-80.

Materials & experimental systems Methods
The cell lines were determined to be negative for micoplasma.
No commonly misidentified cell lines were used in this study. (See ICLAC register) I