The flashfm approach for fine-mapping multiple quantitative traits

Joint fine-mapping that leverages information between quantitative traits could improve accuracy and resolution over single-trait fine-mapping. Using summary statistics, flashfm (flexible and shared information fine-mapping) fine-maps signals for multiple traits, allowing for missing trait measurements and use of related individuals. In a Bayesian framework, prior model probabilities are formulated to favour model combinations that share causal variants to capitalise on information between traits. Simulation studies demonstrate that both approaches produce broadly equivalent results when traits have no shared causal variants. When traits share at least one causal variant, flashfm reduces the number of potential causal variants by 30% compared with single-trait fine-mapping. In a Ugandan cohort with 33 cardiometabolic traits, flashfm gave a 20% reduction in the total number of potential causal variants from single-trait fine-mapping. Here we show flashfm is computationally efficient and can easily be deployed across publicly available summary statistics for signals in up to six traits.


The multiple traits joint ABF is a function of marginal ABFs
We first suppose that we observe N individuals, each with measurements for M quantitative traits that are transformed to meet conditional normality and homogeneity assumptions, conditional on covariates.
Later, we relax this so that a subset of individuals may have missing measurements for some of the traits.
Here, we find expressions for the ABF of causal SNP models for joint and marginal models and show that the information from single trait analyses could be used to evaluate the joint ABF.
To find expressions of the log(ABF) for each of the joint and marginal models we use the approximation based on the Bayesian information criterion (BIC) from the null and causal models (BIC 0 and BIC 1 , respectively) [6]. The log(ABF) approximation (BIC 0 − BIC 1 )/2, is expressed in terms of log likelihoods as log(ABF) where k is the number of causal SNPs in the model and l 1 and l 0 are the log likelihoods of the causal and null models, evaluated at the maximum likelihood estimates.
An expression for the log(ABF) of a causal SNP model for a single trait is found after finding the log likelihoods for the relevant models in a Gaussian framework. Let y j , j = 1, . . . , M denote the vector of N measurements for trait j, γ j represent a particular model with k j SNPs for trait j and X γ j be a N × k j matrix of genotypes scores for k j SNPs that are present in the model γ j for trait j. Under model γ j with causal SNPs X γ j , the log-likelihood of a single trait y j is given by is the MLE variance of the residuals from the fitted model.
Likewise, under the null model of no SNP associations and, without loss of generality, assuming mean 0 for the trait, the log likelihood is Then, using (2) and (3) in (1), the log(ABF) for a single trait j is where V j is the variance of trait j andV γ j is the residual variance from model γ j .
Next, consider M traits that each have a possible model with possibly overlapping causal SNPs X γ j for trait j. Let Y be the N × M matrix of phenotypes and denote its rows by y i· (M-vector of trait values for individual i) and columns by y j (N-vector of trait j values). Under the null model for all traits, the joint log likelihood for M traits is i· is the MLE of the covariance matrix under the null model, having element (i, j) given by 1 N y T i y j (assuming mean 0 for all traits), and NM 2 is obtained by using properties of a scalar and the trace of a matrix.
Likewise, under model γ j with k j causal SNPs X γ j for trait j the joint log likelihood is whereΣ 1 is the MLE of the covariance matrix under this model (covariance matrix of residuals) with element (i, j) given by 1 . Then it follows from (1) with (5) and (6) that the log(ABF) of the joint model containing models γ j for trait j is where K = ∑ M i=1 k i is the total number of SNP effects in the joint model. When there are two traits is the residual covariance for traits i and j, and C 12 is the sample (unbiased) covariance between traits 1 and 2. Using (4), we obtain an expression for g i that is a function of log(ABF), sample size N, and the number of SNPs in model γ j , k j Then, using (4), the sum of the log(ABF) for M single traits simplifies to and it follows that the difference between the joint log(ABF) and the sum of the marginals is where the second determinant is a constant C with respect to the samples. Thus, the joint ABF is proportional to the product of the the marginal BFs and a function of the sample sizes and residual variances and covariances. Residual variances are approximated from the log(BF) for the coinciding model and trait, and residual covariances h i j are approximated as described below. If traits are standardised to have mean 0 and variance 1 and trait summary statistics are unavailable, we may use an estimate of the correlation matrix based on the GWAS summary statistics from the LD score regression approach [3].
with the SNPs contained in models γ i and/or γ j (i.e. union of model SNPs) andβ * i has the same effect estimates asβ i at the SNPs in γ i and has 0 at SNPs that are only in in γ j . The term y T i y j may be estimated from trait summary statistics. Terms of the form y T i X γ jβ j = ∑ k∈γ j S x j y iβ jk , where S x j y i = x T j y i andβ jk is the trait j effect estimate for the kth SNP in γ j , and S x j y i = ∑ N k=1 x jk y ik is calculated from the the single-SNP effect estimates of the kth SNP from the trait j model for trait i (i.e. where RAF x j is the reference allele frequency of x j ). The last termβ T i X T Xβ j relies on the effect estimates from the two trait models and X T X may be approximated from either the genotype or a suitable reference panel, as element (i, j) of the matrix is Thus, all quantities in the final expression for D M could be obtained from marginal analyses of the traits and summary information of the traits.

Traits not measured for all samples
It is common for only a subset of individuals to have measurements for all traits. For simplicity, consider two traits that are both measured in N individuals and let N i− j be the number of individuals with trait i measured, but not trait j; the number of individuals with trait i measured is N i = N + N i− j . Using all N i samples for trait i we obtain the marginal log(ABF i ) as in (4), we then have The joint ABF is obtained in a similar manner, where extra terms are needed to account for the individuals with measurements for only one of the two traits. Let y i− j denote the trait i measurements for individuals that do not have trait j measured. It follows that for the likelihood under the null, we and under the models γ 1 , γ 2 , for traits 1 and 2, whereβ i is based on all N i samples with measurements andβ i− j is based on the N i− j samples with trait i measured and not trait j , we have From (11) and (12) we get Notice that log is the same form as the log ratio in the marginal log(ABF 1 ) expression in (4) and is based on a subset of size N 1−2 from the N 1 samples. Treating this log ratio based on the N 1−2 samples as an approximation to that based on all N 1 samples, we have Likewise for trait 2.
where N j , j = 1, 2 are used, being the number of samples in the estimates forβ j . So, Then, when finding the difference between the joint ABF and sum of marginal ABFs in this setting, the additional terms accounting for missing measurements for one of the traits cancel out between the joint and sum of marginal ABFs, giving a similar form to when all traits are measured.
In general, for M traits, the joint log(ABF) is expressed as where D M for two traits is as in (16) and for M = 3, 4, 5, expressions follow.
When there are more than two traits and some have missing data, additional terms to account for missing measurements are present in the expression for the log(ABF M ). The derivations for 3-6 traits generalise from the two trait scenario and we use the notation δ i jk to represent the term given in (9) for traits i, j, k and analogously for a larger number of traits. In addition, N i jk is the number of individuals with traits i, j and k all measured, N i j−k denotes the number of individuals with both traits i, j measured and not trait k, N i− jk denotes the number of individuals with trait i measured, but neither of traits j and k, and analogous notation is used for more/different combinations of traits.
3 traits: In our flashfm software we include a "fastapprox" option that gives a quicker calculation by ignoring the extra adjustment terms. This is recommended when there are not many missing trait measurements and when a quicker answer is required; by default fastapprox=FALSE, but for 6 traits, only fastapprox=TRUE is available.
The prior probability for the joint models includes a term that gives more weight to joint models that have a shared causal variant between the traits; this term κ is derived in a combinatorial manner and is identical to that used in MFM [1]. As in MFM, a correction term τ is also included to ensure that the prior probability of a certain number of SNPs in a model is the same for any value of κ. When κ = 1, there is no weight for joint models with shared causal variants and the flashfm PP for each model for a given trait is the same to what one would obtain from single-trait fine-mapping, which we also refer to as independent fine-mapping, as it does not make use of data from other traits.

Implementation
There are two options for implementing flashfm. If single-trait fine-mapping results have not already been obtained, they may be generated within flashfm using an extended version of JAM (Joint Analysis JAM assesses the joint effect of multiple SNPs on a trait in an integrated Bayesian penalized regression framework, outputting the posterior probabilities (PP) for the multi-SNP models. This allows us to identify the models with non-negligible evidence that should be the focus when assessing joint models between multiple traits. As JAM operates on a set of tag SNPs due to colinearity issues, we have extended it such that all models are expanded by their tag SNPs in the same manner as GUESSFM [5] (https://github.com/chr1swallace/GUESSFM). This is done by substituting each tag SNP in a model by each of the SNPs that it tags so that if SNPs 1 and 2 are in a model and they each tag t 1 and t 2 SNPS, respectively, the model expands into (t 1 + 1)(t 2 + 1) models, for which ABFs are found using (4;β for multi-SNP models are obtained from the single-SNPβ and the genotype matrix (or reference panel) of the SNPs in the model.
Using a binomial prior distribution we may then find PPs for all of the expanded (and original) models.
For ease of interpretation, we also construct SNP groups (using the snp.picker function of GUESSFM) such that SNPs in the same group could be substituted for one another; SNPs in the same group are in high LD and are rarely selected together in models. The results are then summarised in terms of SNP group PPs by summing over SNP models that fall into each SNP group model; the PP for the SNP group model A + B is the sum over PPs from all models with one SNP from A and one SNP from B.
The posterior probability of model γ 1 for trait 1 is proportional to a sum of the posterior probabilities of all configurations C 1, j , j = 1, . . . , n. Let I i, j be an indicator function, taking the value 1 if γ i ∩ γ j = ∅ and 0 otherwise, and let δ i j = exp D i j Then Rather than considering all model combinations, we reduce the model space by setting a cumulative posterior probability threshold (e.g. cpp=0.99). For each trait, we use the single-trait fine-mapping results to order the models by PP and retain those for which the sum of their PPs is below 0.99. As these δ i j terms depend on the SNPs that are included in each model, a loop over the model combinations is required to make these small calculations.

Related Individuals Implementation
The above derivations are based on a sample of unrelated individuals. If the proportion of related individuals is relatively large such that their removal would be a noticeable loss in data, rather than excluding related samples, an alternative approach is considered. First, single-SNP mixed linear models that account for relatedness are fit for each trait using GEMMA [7] (Genome-wide Efficient Mixed Model Association). The output from GEMMA includes the relatedness-adjusted effect estimatesβ of each SNP for one trait, which may then be used as input to JAM [4] or FINEMAP [2], as above, to identify the models with non-negligible evidence. As the single-SNP effect estimates are adjusted for relatedness, they may be used together with the genotype matrix of unrelated samples (or reference panel) as above to obtainβ for multi-SNP models, which are needed to get log(ABF) as in (4); as these effect estimates are adjusted for relatedness, they may be treated as if obtained from an unrelated sample. The effective sample size is used as N in the log(ABF) calculation.

Supplementary Section 2: Region construction for fine-mapping in the Ugandan cohort
In order to obtain more precision in the construction of the fine mapping regions we consider the centimorgan (cM) genetic distance between SNPs. Approximately 80% of the SNPs (hg19/build 37) in the Ugandan data set do not map to a cM (reference panel) position so missing values were imputed using linear interpolation.
We then considered the GWAS for each of the 33 traits and selected the SNPs using a p-value threshold of 1 × 10 −6 .
Next, we sorted the p-values of the selected SNPs from all trait GWAS in descending order and removed any duplicated SNPs. Finally, regions were constructed using the following steps: 3. Repeat the procedure for subsequent SNPs, checking first if the SNP belongs to any previously constructed region.
Applying this procedure to the Ugandan data set we obtained 56 regions detailed in Supplementary