Sharing information between related diseases using Bayesian joint fine mapping increases accuracy and identifies novel associations in six immune mediated diseases

Thousands of genetic variants have been associated with human disease risk, but linkage disequilibrium (LD) hinders fine-mapping the causal variants. We show that stepwise regression, and, to a lesser extent, stochastic search fine mapping can mis-identify as causal, SNPs which jointly tag distinct causal variants. Frequent sharing of causal variants between immune-mediated diseases (IMD) motivated us to develop a computationally efficient multinomial fine-mapping (MFM) approach that borrows information between diseases in a Bayesian framework. We show that MFM has greater accuracy than single disease analysis when shared causal variants exist, and negligible loss of precision otherwise. Applying MFM to data from six IMD revealed causal variants undetected in individual disease analysis, including in IL2RA where we confirm functional effects of multiple causal variants using allele-specific expression in sorted CD4+ T cells from genotype-selected individuals. MFM has the potential to increase fine-mapping resolution in related diseases enabling the identification of associated cellular and molecular phenotypes.


Introduction
The underlying genetic contribution to many complex diseases and traits has been investigated 15 with great success by genome-wide association studies (GWAS). Various approaches have been 16 developed to identify regions associated with individual diseases, and these have led to the 17 detection of thousands of variants associated with a spectrum of diseases. In particular, much 18 progress has been made in the genetics of immune mediated diseases (IMD), revealing a complex 19 pattern of shared and overlapping genetic etiology 1,2 . 20 21 Fine mapping -the process of distinguishing causal genetic variants from their neighbours -is an 22 essential step to enable the design of functional assays required to understand the mechanism by 23 which the region impacts disease risk, but it is complicated by linkage disequilibrium (LD) 3

. The 24
problem is often approached through stepwise regression 4 which assumes that statistical inference 25 of the best joint model (i.e. a model with multiple causal SNPs) can be derived by starting with the 26 most significant SNP, then conditioning on this and adding the next most significant, continuing this 27 conditioning until no conditionally significant SNPs remain. It has been noted that the SNP with the 28 smallest p value need not be causal, especially if it is in LD with two causal SNPs. 5 Alternative 29 Bayesian fine mapping methods have been developed which use a stochastic search instead of 30 stepwise search 6-8 . Stepwise and stochastic search results may disagree 8 and although stochastic 31 search generally demonstrates improved accuracy 9 these techniques have not yet been widely 32 adopted. 33 34 Here, we systematically compare stepwise and stochastic approaches by application to dense 35 genotype data for six IMD. We aim to address, in particular, the frequency and causes of 36 disagreement between stochastic and stepwise search results. Our results show that stochastic 37 search solutions are more likely to be correct than stepwise search results when sample sizes are 38 in combination with rs61839660:C and rs11594656:T (CCT, frequency 13%) is indistinguishable 88 from the common (susceptible) haplotype TCT (Fig. 1a, p=0. 24). 89 90 First, simulations showed that if the J model were true, both stepwise and stochastic search would 91 correctly identify it (Fig. 1b, Supplementary Table 6). In contrast, if the A+C model were true, 92 stepwise got "stuck" on J, while stochastic search moved from selecting J at lower sample sizes, to 93 A+C at higher sample sizes (Fig. 1b, Supplementary Table 7, further examples in other 94 regions/diseases in Supplementary Tables [8][9][10][11]. A small perturbation on the simulated effect 95 sizes for A+C led both methods to select C or A+C directly, indicating that the potential for joint 96 tagging was dependent on the combined effect sizes. 97 98 Second, we showed mathematically that there was a high probability of J having the smallest p 99 value when A and C were causal only when A and C had similar odds ratios; and that our 100 observed data fell within this region (Fig. 1c). Again, a similar pattern was seen at all other 101 mismatch regions ( Supplementary Fig. 2). 102 and GUESSFM for ATD in region 10p-6030000-6220000. There are four common haplotypes. 105 Three carry the minor allele at the J SNP rs706799, but only those that also carry minor allele at A 106 or C show a significant effect on disease risk. b Comparison of stepwise and stochastic search 107 applied to simulated data. Causal variants were simulated as follows: "J": single causal variant J, 108 OR=0.8; "A<C" causal variants A+C, odds ratios A:0.81, C:0.74; "A>C": causal variants A+C, odds 109 ratios A:0.74, C:0:8. The y axis shows the proportion of simulations in which the stepwise 110 approach chose the indicated model (adding SNPs while p < 10 -6 ) or the average posterior 111 probabilities for each model for the stochastic search approach. Sample size (x axis) is the number 112 of cases and controls. c Assuming A and C are causal, this plot shows the probability that J has 113 the smallest p value as a function of the effect sizes (log odds ratios) at A and C. The estimated 114 effects for A and C from real data are shown by a point, and the simulations from b by "<" and ">" 115 for A<C and A>C conditions, respectively. Finally, we showed that the pattern of LD between three SNPs (two causal and a third tag), 119 together with MAF (minor allele frequency) and effect sizes, determine whether a tag SNP has the 120 smallest expected p value (Fig. 2a, Supplementary Note). At the extremes of this pattern, there is a 121 non-zero probability that the tag model will be erroneously selected, even by a criterion such as 122 BIC which penalizes the larger model (Supplementary Note). While we cannot identify how many 123 cases of joint tagging may exist in our GWAS data because the causal variants are unknown, we 124 can quantify what proportion of 3 SNP LD matrices match this pattern under an assumption of 125 equal odds ratios at the causal variants. Doing so, we found that 20-40% of potential common 126 causal variant pairs (MAF>5%) had a potential joint tag, though this was highly variable across 127 regions (Fig 2b- Together, these results better characterize and quantify the potential frequency of joint tagging, in  148   which a non-causal SNP carried on population haplotypes together with distinct causal SNPs with  149   similar effects may have a smaller single SNP p value than either causal variant itself. This can  150 cause stepwise search to get "stuck" on the tag, whereas stochastic search will find both causal 151 variants, if the sample sizes are large enough. However, with smaller sample sizes, stochastic 152 search may also choose the tag, because small sample sizes may not contain enough information 153 to overcome the strong penalty that needs to be applied to more complex models to avoid over-154 fitting. Thus, joint tagging may potentially affect many more cases than the simple comparison of 155 stepwise and stochastic search results based on fixed sample sizes above identify. 156 157 158 We noticed a striking overlap between the fine mapping results for different diseases in these 159 regions, with 20 of 30 regions with two or more associated diseases showing evidence of overlap 160 ( Supplementary Fig. 3), consistent with previous reports of shared genetic etiology between the 161 diseases 2 which inspired the creation of the ImmunoChip. This motivated us to exploit the sharing 162 between diseases, by extending the stochastic search approach to jointly analyse multiple 163 diseases, borrowing information between them, to help overcome these sample size limitations. 164

Proposed method for multinomial fine mapping (MFM) of multiple diseases
We use a multinomial logistic regression framework which is the natural extension of the binomial 165 logistic model, in which each individual is assumed to belong to exactly one disease group or a 166 pooled group of controls shared between diseases. This formally accounts for the sharing of 167 controls between diseases in different studies. 168 169 We introduce the concept of "configurations" -sets of causal variant models for each disease, and 170 we borrow information between the diseases by means of a prior which upweights configurations 171 that share one or more causal variants between diseases by a factor ! (Fig. 3). Such a parameter 172 is also used in colocalization analysis, with values ranging from 100 1,17 to 1000 18  One obvious challenge for dealing with configurations, is that the number of models that needs to 190 be considered for each disease is already large, and the number of possible configurations is the 191 product of these. This implies that exponentially increased computational time and memory will be 192 required to evaluate all configurations, and to store these results. We provide solutions for both 193 challenges. First, we show the log Bayes factor for a multinomial model that simultaneously 194 considers all diseases can be approximated by a quantity that can be rapidly calculated -the sum 195  216 We examined the performance of MFM by simulation. We found that when causal variants 217 overlapped between diseases, MFM was able to recover the correct models at smaller sample 218 sizes than individual disease analysis ( Fig. 4a- 241 We applied MFM to all 30 ImmunoChip regions with at least two associated diseases (complete 242 results in Extended Information and Supplementary Table 16). We identified seven regions for 243 which the top model by single disease stochastic search and MFM differed ( Table 2). Four of these 244 were single SNP models under single disease analysis which moved to an alternative single SNP 245 in joint analysis. For three of these four, the difference was seen in analysis of a UK-only subset, 246 so that we could consider single-disease analysis of the UK+international data which included 247 more samples but used the more conventional analysis method as an "adjudicator". In all three 248 cases, this adjudicator matched the MFM analysis of the UK-only data, suggesting that UK single 249 disease analysis was limited by power, and that UK MFM analysis increased power, allowing 250 conclusions to be drawn that were consistent with those seen in a larger single disease analysis. 251

252
One of the multi-SNP regions which showed differences across multiple diseases was on 253 chromosome 2q, harbouring the candidate gene CTLA4. In stepwise analysis, iRA, T1D, ATD, and 254 CEL all converge on a single SNP model, in the group labelled G in the stochastic search results, 255 while for iCEL a single SNP is selected in group I (Table 3, Fig. 5a). For single disease stochastic 256 search, we find CEL (UK only) and ATD have a single signal in the group labelled G, matching the 257 stepwise results, while RA and T1D both have 2 signals, in groups labelled E and H. The iCEL 258 result is more uncertain, with the posterior spread between I+K, I or E+G. Note that K is also the 259 second selected SNP for iCEL stepwise regression (p=4x10 -6 ), though it doesn't reach our adopted 260 significance threshold. Simulations show that G may tag an E+H model ( Fig. 5b-c, Supplementary 261 Tables 8-9). 262 263 MFM finds increased support for E+H for RA and T1D while the CEL and iCEL results become 264 more concentrated with support for G or E+G (Table 3). While we suggested G may tag E+H, MFM 265 maintains strongest support for G in ATD, although there is also posterior support for H in 266 combination with other groups (group marginal posterior probability of inclusion, gMPPI=0.60). A 267 previous attempt to fine map autoimmune disease association, by colocalization analysis of T1D, 268 RA and CEL (using the same UK data as here) came to similar conclusions, finding strong support 269 for E+H models for iRA and T1D and either G or E+G for CEL 1 . However, a more recent analysis 270 of T1D and RA, also in largely the same samples, identified a different pair of variants, rs3087243 271 (G) and rs117701653 (C) 19 for both diseases using an exhaustive search of all one and two SNP 272 models. 273

274
We compared the models suggested by all these studies across all diseases by BIC 275 (Supplementary Table 17) and using haplotype analysis (Fig. 5d). This visually highlighted 276 rs117701653/C identified for iRA by exhaustive search 19 and rs76676160/K identified by stochastic 277 search for CEL and iCEL as having similar protective effects across all diseases and low minor 278 allele frequencies (<0.05). The two SNPs are unlinked (r 2 <0.01) and in low LD with other 279 genotyped or imputed SNPs outside their groups (r 2 <0.2). The 2-SNP models E+H identified here, 280 and G+C 19 have similar BIC in our data for iRA and iCEL (Supplementary Table 17), but the 281 greater number of SNPs in the E and H groups mean that E+H encompasses many more possible 282 causal variant pairs and so has greater grouped posterior support. Additionally, individual E+H 283 models have a clearly better fit than G+C for T1D (Supplementary Table 17 Comparison of stepwise and stochastic search applied to simulated data. Causal variants were 295 simulated as follows: "G": single causal variant G, OR=0.   Our previous report of stochastic-stepwise mismatch focused on MS and T1D in the IL2RA region 8 . 310 We identified four groups of SNPs corresponding to four causal variants for T1D, with results 311 agreeing between stepwise and stochastic search 8  effects for A and D from MS data are shown by a point, and the simulations from c by "<" and ">" 333 for A<D and A>D conditions respectively. 334 335   In memory CD4 + T cells, A-het and A+D-het individuals showed an allelic imbalance with the MS 390 protective A haplotype producing more IL2RA mRNA, inconsistent with B causing the imbalanced 391 expression since A+D-het individuals tested are homozygous for B (Fig. 7b). Also inconsistent with 392 B causality is the lack of allelic imbalance in memory T cells from D-het individuals who are 393 heterozygous at B. In naive CD4 + T cells, D-het as well as A+D-het heterozygotes had an allelic 394 imbalance with the protective D haplotype producing less IL2RA mRNA than the susceptible or 395 protective A haplotypes, confirming our previous observations of decreased CD25 + naive CD4 + T 396 cells associated with donors having the protective D haplotype 8 . Again, this is inconsistent with B 397 causality, since only D-het and not A+D-het individuals are heterozygous at B. In A-hets donors 398 there is appears to be an allelic imbalance in naive CD4 + T cells favouring the A versus susceptible 399 haplotype, which is the opposite direction to that observed with protection at D and could reflect an 400 anticipatory differentiation of naive T cells toward the memory lineage and its phenotype of 401 increased CD25 expression in A haplotype donors. However, it is not significant, and we did not 402 observe an increase in CD25 + naive T cells associated with the A haplotype in a previous study 23 . 403 404 Additionally, we identified four individuals, three of whom carry rare IL2RA haplotypes (Fig. 7c): 405 donor 1 carries a common haplotype combination that is homozygous across A, B, D; donor 2 406 carries the minor allele at B in the absence of a minor allele at either A or D, donor 3 carries a 407 minor allele at D but not B, and donor 4 also carries a minor allele at D but not at B on one 408 haplotype and minor alleles at A and B on the other haplotype (Fig. 7a)

434
Fine mapping is a general problem in statistical genetics, important in its own right and for 435 informing integrative downstream analyses 18,28 . We have shown that there are candidate causal 436 SNP models for which stepwise regression does not converge to the correct solution, even as the 437 sample size grows, and described the constraints on LD that give rise to this joint tagging 438 phenomenon. In contrast, stochastic searches do tend to the correct solution as sample sizes 439 increase, and we propose they should be more widely adopted by those interested in fine mapping 440 GWAS results. However, even stochastic search methods are limited by existing sample sizes 441 when there are multiple causal variants in proximity, and may produce similar results to stepwise 442 methods when sample sizes are insufficient. 443

444
Our new method MFM borrows information across diseases and is thus related to, but distinct 445 from, methods which aim to assess whether two diseases share causal variant(s) in a region 18,29 or 446 which fine map those variants conditional on evidence for shared causal variants 1 . We avoid 447 enforcing identity of causal SNPs or their effect sizes between different diseases, as in analysis of 448 an overarching disease phenotype (eg "autoimmune disease" 19 ). It is clear from our results that, 449 causal variants may differ between diseases in the same region and that, even when causal 450 variants are shared, effect sizes and even direction of effects may differ between diseases. While 451 we use individual level genotype data from IMD studies, the method could be adapted to summary 452 GWAS data with Bayes factors calculated using summary data 7 or applied to other collections of 453 diseases where causal variants may tend to be shared, such as psychiatric diseases 30 or 454 metabolic-related traits 31 . 455

456
One key result from our analysis is that sample sizes in the low tens of thousands may still not be 457 large enough to robustly fine map multiple causal variants. This motivates continued collection of 458 GWAS samples for diseases too infrequent to be found in large numbers in the Biobank style 459 datasets, and greater sharing of data between researchers working on related diseases to better 460 map the most likely genetic causal variants. A particular note of caution is raised by the genomic 461 locations where we find discrepancies between stochastic and stepwise results. These are almost 462 entirely those with the strongest biological prior for involvement in these diseases, and also those 463 with typically the strongest effects, and thus greatest power. We question whether these regions 464 are most likely to give rise to discrepancies because they harbour the largest numbers of potential 465 effects or whether, if we had access to much larger datasets, we would see similar discrepancies for eczema and IBD. Alternatively, the genetically-determined level of CD25 on memory CD4 T 486 cells could influence their likelihood of differentiating into particular types of cytokine-producing 487 effector cells, a phenotype beneficial for some diseases but not others. We propose that, rather 488 than attempting to colocalize eQTL signals and disease associations that are both determined by 489 stepwise analysis 32 , disease haplotype-directed searches for allele-specific expression exemplified 490 in this study will lead to greater clarity when unraveling cellular mechanisms in immune-based 491 diseases. 492 OR relating the odds of disease in heterozygote carriers of the non-reference allele compared to 501 the homozygote reference allele. We assumed a multiplicative model throughout. 502

Simulations -single trait
The SNPs belonging to the above-mentioned groups, as well as the lead SNPs for autoimmune 503 thyroid disease (ATD; rs706799), alopecia areata (AA; rs3118470), rheumatoid arthritis (RA; 504 rs10795791), and ulcerative colitis (UC; rs4147359) were extracted from the generated data for 505 analyses via stepwise regression and stochastic search; the lead SNP for multiple sclerosis forms 506 group B. For each replication a stepwise regression model was fit, adding SNPs to the model using 507 a p-value threshold of 1×10 -6 . To generate stochastic search results, we used GUESSFM 8 , setting 508 a prior of 3 causal variants for the region to encourage good mixing of the chains in the initial 509 Bayesian variable selection, and setting the prior to a more conservative 2 causal variants per 510 region to obtain final model posterior probabilities (PP). Model fits were summarized by the 511 proportion of times each model was selected via stepwise regression or the mean of the 512 GUESSFM posterior probabilities for each model. 513 514 We adapted the HapGen2 simulation outlined above to simulate datasets for two case and one 515 control set; code is available in https://github.com/jennasimit/MFMextra. First we used HapGen2 to 516 generate a population of 100,000 individuals based on the CEU 1000 Genomes Phase 3 data. 517

Simulations -multiple traits
Causal variants for each trait were randomly selected within particular SNP groups for a certain 518 disease model (see Supplementary Table 18); when the same SNP group contained a causal 519 variant for both diseases, one variant was selected from the group and set as causal for both 520 diseases. Logistic regression models with the selected causal variants and odds ratios (OR) were 521 then used to assign each individual as either a member of the controls, disease 1 cases, or 522 disease 2 cases until the desired number of individuals in each group was attained; let ORjk be the 523 odds ratio for causal variant j and disease k. The prevalence for both diseases was set to 0.1, as 524 our purpose is to generate cases and controls for method comparison. In particular, the following 525 steps were used to ascertain control/disease 1/disease 2 status, where xij is the number of non-526 reference alleles of variant i for individual j (i.e. genotype score), gj is the vector of genotype scores 527 for individual j, " # =log(0.1), and " %& =log(ORik) is the effect of causal variant i for disease k. We simulated either shared configurations where each disease was under the influence of two 544 causal variants, one shared between diseases (A) and one unique to each disease (one from C, 545 one from D); or independent configurations, where the two diseases were under the influence of 546 distinct causal variants (one from each of A and D for one disease and one from C for the other 547 disease) or one disease had no associations in the region (one from each of A and D for one 548 disease and none for the other disease). All causal variants were assigned an odds ratio of 1.25 or 549 1.4. For both diseases, equal-sized case-control samples consisting of N cases and N controls 550 were considered for N ranging from 1000 to 5000; each simulation setting had 100 replications. 551 We compared the independent stochastic search analyses of each disease with the multinomial 552 approach with upweighted sharing based on a range of target odds (i.e. prior odds of no sharing of 553 causal variants between one disease and any other disease). We focused on a target odds (TO) of 554 1, such that there is an equal probability of sharing to non-sharing. Results for a range of TO from 555 9 (no sharing more likely than sharing of causal variants) to 0.35 (sharing more likely than distinct 556 causal variants) are in Supplementary Tables 12-15. 557

Mathematical predictions of SNP with minimum univariate p value
We used "sunbeam plots" to characterize how changing the odds ratio of two causal SNPs in a 560 model can change the probability that a third variant will have the minimum p-value (and hence be 561 selected first in any stepwise fine mapping algorithm). We utilized components of the simGWAS 562 package (http://github.com/chr1swallace/simGWAS) to calculate expected GWAS Z scores for any 563 given set of causal variants and their effect sizes, across those causal variants and their 564 neighbouring SNPs 35 . We considered the behaviour of Z scores at each of two nominated "causal" 565 variants (following Fig. 1, let us refer to these variants as A and C) with a third SNP, not itself 566 causal, but potentially correlated with both A and C (in Fig. 1, this is SNP J). For each of a range of 567 possible odds ratios, we computed which of the three SNPs had the smallest expected p-value, 568 and coloured that square of the grid correspondingly. When the log odds ratios of both A and C 569 were close to 0, then no SNP had a low p-value and it was not possible to find significant evidence 570 of disease association in the region. This section of the grid was coloured white. Superimposed 571 upon the grid is a point corresponding to the odds ratio we computed for A and C from the real 572 dataset. Code to produce these plots is at https://github.com/chr1swallace/MFM-573 paper/tree/master/sunbeams. 574 575 576 We collated individual genotype data generated using the ImmunoChip for a total of 61,641 577 individuals, formed of controls and six disease cohorts: MS (UK subset) 11 , T1D 10 , juvenile 578 idopathic arthritis (JIA, UK subset) 14 , celiac disease 13 , rheumatoid arthritis (RA) 15 and autoimmune 579 thyroid disease (ATD) 12 (Supplementary Table 1). All genome coordinates are from build GRCh37. 580

Fine mapping analyses of ImmunoChip-genotyped diseases
To ensure controls could be combined across datasets, we restricted analysis for the multinomial 581 model to UK samples, and used principal component analysis including 1000 Genomes data to 582 exclude 2 individuals who fell outside individual country clusters. Genotypes were compared 583 between datasets to ensure exclusion of duplicate samples. Data were split into subsets according 584 to the densely genotyped regions targeted by the ImmunoChip (Supplementary Table 2) and 585 imputed to 1000 Genomes phase 3 33 using SHAPEIT 36 and IMPUTE2 37 . Phased reference data 586 was downloaded from https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html. Country and 587 the first 4 principal components were included as covariates in all regressions to account for 588 population structure. SNPs were excluded if they had info scores < 0.3, certainty < 0.98, |Z| for 589 HWE > 4 in UK controls, MAF < 0.5% in UK controls, call rate < 0.99 in any case or control group, 590 or an absolute difference in "certain genotype" call rates between controls and any case group of > 591 5%. 592 593 Forward stepwise regression was performed using univariate logistic regressions across all SNPs 594 in the region. The SNP with the strongest association (smallest p value) was selected, then all two 595 SNP models containing the selected SNP and any other SNP were considered, and the process 596 repeated until no SNP could be added with a marginal p < 10 -6 . 597 598 Stochastic search fine mapping of single diseases was performed using GUESSFM 599 (http://github.com/chr1swallace/GUESSFM). Initial searches were performed after tagging at 600 r 2 <0.99 with an optimistic binomial prior for the number of causal variants per region with 601 expectation set at 3 to allow good mixing of the chains. Reanalysis of the expanded tag sets for 602 SNPs in models included in the model set with total posterior probability 0.99 was performed using 603 approximate Bayes factors and the more conservative prior expectation of 2 causal variants per 604 region using GUESSFM. GUESSFM results were combined using the methods proposed in this 605 paper, as implemented in the R package MFM (http://github.com/jennasimit/MFM). We set the 606 prior odds that two diseases shared any causal variants to 1 (ie a 50% probability that they share 607 none). For a number of diseases, d > 2, we set the prior that the diseases share no causal variants 608 to 0.5 √E4+ , where the exponent is the geometric mean of the exponents in the (nonsensical) 609 extremes 0.5 E4+ which assumes all diseases are independent and 0.5 which assumes all diseases 610 are completely dependent. 611 612 Code to perform these steps is available at https://github.com/chr1swallace/MFM-analysis. 613 SNP grouping 614 SNPs with marginal posterior probability of inclusion > 0.001 were grouped according to criteria of 615 substitutability -that one SNP could substitute for another in all models. We reasoned that this 616 meant SNPs would need to be in LD -high r 2 -and rarely selected together in models -i.e. model 617 selection correlation (r) should be negative. We hierarchically cluster SNPs within each disease 618 according to r 2 x sign(r) using complete linkage, and group SNPs by cutting the tree such that all 619 SNPs within a group must have pairwise r 2 > 0.5, pairwise r < 0, and marginal posterior probability 620 that both are included in a model was < 0.01. We then identify overlapping groups defined in 621 different diseases, and merge or split groups when they meet this criteria. The specific algorithm is 622 naive T cells were sorted as CD3 + CD4 + CD8 -CD127 med/high CD25 low-med CD45RA + and CD27 + , 644 whereas CD4 + central memory T cells were sorted as CD3 + CD4 + CD8 -CD127 med/high CD25 low-med 645 CD45RAand CD27 + . 646 To phase the direction of effect from the four donors carrying rare IL2RA haplotypes (Fig. 7a, 7c), 647 their haplotypes were compared to those found in the 1000 Genome Project CEU data to assess 648 the allele frequency of the ASE readout SNP (rs12244380, A or G), to predict which allele is most 649 likely to be carried. developed the MFM method, performed statistical analyses and interpreted results, wrote the 762 paper. All authors read and agreed the manuscript. 763 Tables   Table 1: Regions that have conflicting models selected by stepwise and stochastic search for at least one autoimmune disease. Each row summarizes results for a single region, defined by chromosome, start and end coordinates (hg19), with neighbouring or previously reported candidate gene names shown for orientation. Each stepwise search model consists of a single SNP and we also indicate which SNP group it belongs to, by a letter in front of the SNP rs ID; the SNP group size, p-value of the SNP, and stochastic search group posterior probability (GPP) are also given. Analogous information is given for stochastic search models and for 2-SNP models the joint p-values from these model are given. The LD column lists the r 2 between the stepwise SNP and the SNP(s) from the stochastic search model.       MFM was run at a range of target odds (TO; prior odds of no sharing of causal variants between one disease and any other disease) values to illustrate the impact of TO and with decreasing TO there is an increasing prior weight for sharing of variants; TO=null indicates no sharing and independent stochastic search analyses were run and TO=1 was the setting used in our MFM analyses.   international RA (iRA) analyses are included, but not RA (UK only) since it did not meet our finemapping criteria.

Supplementary
Supplementary Table 19: Allele-specific expression (ASE) analysis results. The genotype of a SNP from each of the IL2RA SNP groups defined in Supplementary Table 18 is listed for each participant that ASE was performed on. The genotypes are phased so that all the SNPs listed for allele 1 are on the same chromosome, and gives directionality for the ASE readout SNP rs12244380, which is in the 3'UTR of IL2RA. ASE was measured using targeted NGS and the counts from each allele of rs12244380 are provided with 3-4 technical replicates performed. The average of the technical replicates was used to calculate the allelic ratio. Some samples were tested multiple times and these are highlighted in green. The allelic ratio for the central memory CD4 + T cells and naive CD4 + T cells are calculated as the ratio of A to G alleles at the readout SNP, and then re-ordered based on phased haplotypes to match the direction shown as top:bottom in the cartoon haplotypes depicted Fig. 7a. Not all samples were tested with both naive and central memory CD4+ T cells due to cell number availability. The genomic DNA samples are included as a control showing there is no bias regardless of genotype and all are reported as the ratio of the A:G allele of rs12244380.