Model-based clustering for identifying disease-associated SNPs in case-control genome-wide association studies

Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human diseases by identifying disease-associated single-nucleotide polymorphisms (SNPs). The traditional SNP-wise approach along with multiple testing adjustment is over-conservative and lack of power in many GWASs. In this article, we proposed a model-based clustering method that transforms the challenging high-dimension-small-sample-size problem to low-dimension-large-sample-size problem and borrows information across SNPs by grouping SNPs into three clusters. We pre-specify the patterns of clusters by minor allele frequencies of SNPs between cases and controls, and enforce the patterns with prior distributions. In the simulation studies our proposed novel model outperforms traditional SNP-wise approach by showing better controls of false discovery rate (FDR) and higher sensitivity. We re-analyzed two real studies to identifying SNPs associated with severe bortezomib-induced peripheral neuropathy (BiPN) in patients with multiple myeloma (MM). The original analysis in the literature failed to identify SNPs after FDR adjustment. Our proposed method not only detected the reported SNPs after FDR adjustment but also discovered a novel BiPN-associated SNP rs4351714 that has been reported to be related to MM in another study.


Introduction
Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human diseases by identifying disease-associated single-nucleotide polymorphisms (SNPs).The most commonly-used approach in GWASs is the SNP-wise approach, in which a test of association is performed for each SNP, and then the Pvalues are adjusted for multiple testing.However, because of multiple testing adjustment to a huge number (> 1 million) of tests in GWAS, this approach often lacks power.Multiple testing adjustment uses no information other than P-values, which insufficiently models the relationships among SNPs, and need to be improved.
To reduce the number of tests, SNP-set analysis has been proposed (e.g., Wu et al. 2010 [1]; Dai et al. 2013 [2]; Lu et al. 2015 [3]; Cologne et al. 2018[4]).The idea is to use SNP sets to replace individual SNPs.Hence, the number of tests can be reduced.
However, it is challenging to define SNP sets.One approach is to define SNP sets based on existing biological knowledge.However, biological knowledge is subject to error.Poor quality of SNP-set can lead to low power (Fridley and Biernacka, 2011[5]).
Penalized regression approach has also been proposed in GWASs.For instance, linear mixed models (e.g., Kang et al. 2010 [6]; Lippert et al. 2011 [7]; Zhou and Stephens 2012 [8]) treat the effect of the SNP marker of interest as fixed, with the effects of all other SNP markers as normally distributed random effects.This process is repeated in turn for every SNP marker.However, it is a paradox to treat markers as fixed for inference but then otherwise as random to account for population structure for inference on association with other markers (Goddard et al. 2016[9]; Chen et al. 2017 [10]).
Bayesian hierarchical regression models treat the effects of all SNPs as random effects with either local priors or non-local priors (Mallick and Yi 2013 [11]; Fernando and Garrick 2013 [12]; Wang et al. 2016 [13]; Chen et al. 2017 [10]).Noticing that penalized regression methods often lead to large number of false positives and Bayesian regression methods are computationally very expensive, Sanyal et al. (2018) [14] proposed a non-local prior based iterative SNP selection tool for GWASs, which enables borrowing information across SNPs and can utilize the dependence structure across SNPs.However, all of these methods are for quantitative outcomes (e.g., height) in GWASs.
However, to the best of our knowledge, no methods have been proposed to borrow information across SNPs (categorical variables with three levels of genotype) to analyze case-control GWAS data that have binary phenotype (cases vs. controls).
In this article, we proposed a novel model-based clustering method for case-control GWASs (binary phenotype) by transforming the challenging high-dimension-smallsample-size problem to low-dimension-large-sample-size problem.Specifically, we treat SNPs as "samples" and subjects as "variables" and aim to cluster SNPs to three groups: (1) SNPs with minor allele frequencies (MAFs) higher in cases than in controls; (2) SNPs with MAFs lower in cases than in controls; and (3) SNPs with MAFs in cases same as in controls.For a given SNP, we assume its genotypes follow a multinomial distribution and the MAF follows a beta distribution.We also assume that the cluster proportions follow a Dirichlet distribution.In our method, we pre-specify the patterns of clusters by minor allele frequencies (MAFs) of SNPs between cases and controls, and enforce the patterns with the guide of prior distributions.The proposed model-based clustering method can improve the power of detecting disease-associated SNPs by borrowing information across SNPs.For details, please refer to the METHODS Section.
Our method is motivated by our investigation of the genetic risk factors of the bortezomib-induced peripheral neuropathy (BiPN) in treating Multiple Myeloma (MM) by using GWASs.MM is a type of cancer that causes a group of plasma cells (a type of white blood cell in the bone marrow that helps fight infections by making antibodies that recognize and attack germs) to be cancerous [24].The MM cancer cells produce abnormal proteins that can cause complications that can damage the bones, the immune system, kidneys, and red blood cell.MM is the third most common blood cancer in the United States.Bortezomib is a first-in-class proteasome inhibitor to treat MM [25,26].However, Bortezomib has some side effects, such as the development of a painful, sensory peripheral neuropathy (PN) [27][28][29].Bortezomib could induce neurotoxicity in neuronal cells by several mechanisms that lead to apoptosis [30].The symptoms of the BiPN include neuropathic pain and a length-dependent distal sensory neuropathy with a suppression of reflexes.Due to BiPN, patients often discontinue bortezomib treatment despite a good response to the therapy [31].If we could identify patients at the risk of developing BiPN, physicians then can choose alternative therapies, such as using weekly, reduced-dose, or subcutaneous approaches.
However, BiPN mechanisms are mostly unknown.It has been shown that a higher cumulative dose is likely to predict the increase of severity of BiPN [32,33].Pre-existing neuropathies, comorbidities (like diabetes mellitus) or myeloma-related peripheral nerve damage may also increase the risk of developing BiPN [34,35].Meregalli (2015) [36] provided a review of bortezomib-induced neurotoxicity.
GWAS could be used to unbiasedly identify genetic variants that will have a direct or indirect effect on drug sensitivity [29].NHGRI GWAS Catalog [40,41](https://www.ebi.ac.uk/gwas/) lists only two GWA studies that have been performed to identify SNPs associated with BiPN for MM patients.The first GWAS was performed by Magrangeas et al. (2016) [29], who identified one BiPN-associated SNP (rs2839629) based on 370,605 SNPs using a discovery set (469 samples) and a validation set (114 samples).However, results did not reach a genome-wide significance level.The second GWAS was conducted by Campo et al. (2018) [42], who identified four BiPN-associated SNPs (rs6552496, rs12521798, rs8060632, and rs17748074) based on 646 samples.Again, the results did not reach a genome-wide significance level.Moreover, each of the four lists of BiPN-associated SNPs listed above (the 20 SNPs identified by Broyl et al. (2010) [28]; the five SNPs identified by Favis et al. (2011) [39]; the one SNP identified by Magrangeas et al. (2016) [29]; and the four SNPs identified by Campo et al.
(2018) [42]) was not replicated in the other three studies.
Note that the existing two GWASs [28,42] used SNP-wise approaches (i.e., performing one association test per SNP), which over-adjusts for multiple tests due to insufficiently modeled relationships among SNPs (i.e., true FDRs/FWERs are smaller than nominal FDR/FWER levels), and hence are not powerful enough when sample sizes are relatively small.Therefore, the missing heritability [43] of BiPN could be due to less powerful statistical methods.Novel statistical methods are needed, they could borrow information across SNPs and better control FDR at nominal levels than the traditional over-conservative multiple testing adjustment approaches.
Our novel method for SNP discovery is a model-based clustering approach.In our method, information can be shared across SNPs by grouping SNPs into three clusters.
We pre-specify the patterns of clusters by minor allele frequencies (MAFs) of SNPs between cases and controls, and enforce the patterns with the guide of prior distributions.Using simulation studies and re-analysis of real data from Magrangeas et al. (2016) [29], we demonstrated that our method out-performs traditional approaches.In particular, compared to SNP-wise approaches, our method better controls FDR at a nominal level, and therefore it has a better sensitivity.

Results of simulation studies
We conducted simulation studies to compare the performance of our model-based clustering method with the SNP-wise approach (e.g., logistic regression followed by multiple testing adjustment ).For our method, data analysis of each simulated dataset uses two different ways to choose values of hyper-parameters of the prior distributions for MAFs (obtained from either truncated Beta(2,5) or empirical distribution via moment matching approach).Details on the design of simulation studies and the choice of hyper-parameters for MAF priors are explained in the 'METHODS' Section.
Figure 1 shows the simulation results for 500,000 SNPs with 200 effective SNPs using four different MAF distributions for data generation and two different sample sizes.The results of comparing sensitivity are presented in the right panel of the figure, where the boxplots represent differences between our method and the SNP-wise approach, which were obtained by considering sensitivity of our method using truncated Beta(2, 5) (colored in light grey) or empirical distribution (colored in dark grey) minus sensitivity of the SNP-wise approach for each simulated dataset.Our method outperformed the SNPwise approach in sensitivity, since all of the boxes are on the right-hand side of the vertical dashed line 0. Boxplots in the middle panel show FDRs for the SNP-wise approach, our method using truncated Beta(2, 5) for analysis, and our method using empirical estimates for analysis.FDR of detected SNPs using our method is much closer to the nominal level 0.05 (vertical dashed line) than the SNP-wise approach for every setting presented in the figure.We conducted Wilcoxon signed rank tests to compare our method and the SNP-wise approach on |FDR-0.05|and on sensitivity for the settings in Figure 1.All of tests were significant with small P-values (< 0.0093; see Supplementary Table S2), confirming that the differences observed from the parallel boxplots are statistically significant.By comparing the top four rows with the bottom four rows in Figure 1, we also observed that the improvement of our method over the traditional approach becomes more prominent when the sample size of the study was smaller.
In all other settings of simulations studies, compared with the traditional approach, our method consistently showed higher sensitivities and FDRs closer to the nominal level of 0.05 (see Figures S1-S5 in Supplementary File S4).Results from Wilcoxon signed rank tests for these settings are also provided in Supplementary Table S2 with all P-values smaller than the significance level 0.05.Even when we incorrectly specified the prior distributions for analysis (e.g., when we used different values between data generation and data analysis for hyper-parameters  and ), our method still outperformed the traditional SNP-wise approach (see rows 2-4 and 6-8 in Figure 1 and S1-S5 in Supplementary File S4).
In summary, our method performed better in discovering effective SNPs associated with the outcome.The traditional SNP-wise approach over-penalizes multiple testing and insufficiently utilizes the shared information among SNPs.When targeting on the nominal FDR level 0.05, the traditional approach always had a true FDR less than the nominal level, which reduces sensitivity.

Results of real data analysis
From the Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/),we downloaded two SNP datasets (GSE65777 and GSE66903) that were used by A genome-wide association study of the 370,605 SNPs after quality control (QC) on the discovery data GSE65777 was conducted by Magrangeas et al. (2016) [29].In our analysis, 247,372 SNPs that passed our QC criteria, which is basically the same as the criteria used by Magrangeas et al. (2016) [29] except for two minor changes (see Section D of Supplementary File S3 for details on our QC steps).
We first re-analyzed discovery data (GSE65777) with SNP-wise approaches.The association between the outcome and each SNP was tested by both logistic regression and the Cochran-Armitage test.No SNP is significant after multiple testing adjustment at FDR level 0.1.Note SNPs reported by Magrangeas et al. (2016) [29] were not detected with the genome-wide threshold 5x10 -8 , but with a much larger P-value threshold 10 -5 .
We then analyzed the discovery dataset (GSE65777) with our method.The values for hyper-parameters α and β were set to their moment estimates from SNP data.Table 1 lists the significant SNPs detected based on the combinations of settings of two different pseudo counts and two targeted FDR levels.Since all significant SNPs detected by our method have been adjusted for FDR, our method is more powerful in detecting SNPs than the traditional approach.
We next analyzed the validation dataset (GSE66903) to validate the significant SNPs in Table 1, using exactly the same approach as Magrangeas et al. ( 2016) [29].The SNP rs2839629 can be validated (with a P-value of 0.0324 after multiple adjustment controlling FDR level at 0.1), which is detected in discovery dataset using a pseudo count of (3, 3, 3) and a detection rule of FDR 7 < 0.1 (Table 2).
We tried analyzing the data with a different pseudo count (5,5,5), and get exactly the same results as using (3,3,3).When we used a stronger prior by increasing the pseudo count to (20,20,20), we obtained six SNPs that were assigned to the clusters of significant SNPs.Among the six SNPs, rs4351714 is a possible novel SNP to BiPN, which locates in the intron region of gene KDM5B and is proved to be associated with multiple myeloma [44,45], but no existing literature has reported that rs4351714 is associated to BiPN.KDM5B is known as a member of the KDM5 subfamily that serves as transcriptional co-repressors, specifically catalyzing the removal of all possible methylation states from lysine 4 of histone H3 (H3K4me3/me2/me1).It has been linked to control of cell proliferation, cell differentiation and several cancer types.By employing a KDM5 enzymes inhibitor in myeloma cells, a higher quartile of KDM5B expression was found to be associated with shorter overall survival in myeloma patients [45].

Discussion
We proposed a novel model-based clustering method to characterize the association between SNPs and a binary outcome in case-control genome-wide association studies.
Compared with the traditional SNP-wise approach, it has advantages in efficiently utilizing the data, since we account for the relationships among SNPs in the model.Our novel method has two major advantageous features.
First, compared to the traditional method, our method provides more power to detect true SNPs associated with the outcome and better controls FDR at a nominal level without an over-conservative penalty from multiple testing adjustment.In the traditional SNP-wise approach, an association between the outcome and each SNP is tested separately, and then their P-values are adjusted for multiple testing.The multiple testing adjustment is purely based on P-values, which insufficiently utilizes the relationship among SNPs.In contrast, we group SNPs into clusters according to the pattern of their MAFs, allowing SNPs with similar patterns to share information with each other.The advantage of this feature of our method is demonstrated in both simulation studies and the re-analysis of the real data from a study on patients with multiple myeloma treated with bortezomib.
Second, our model-based clustering method can handle millions of SNPs, which makes it tractable to "simultaneously" model a huge number of SNPs from the ultra-high dimensional GWAS data.Though the model is complex and involves millions of parameters, making the algorithm of model fitting quite challenging to implement, we integrate out the nuisance parameters (i.e., remove nuisance parameters from model likelihood by averaging likelihood over the distribution of nuisance parameters).By only dealing with the essential parameters in the model fitting process, we make the algorithm feasible without losing information.
Note that our novel model-based clustering method is different from the standard clustering methods.Standard clustering methods are an unsupervised learning method that discovers patterns freely from data.In contrast, supervised learning methods train models with both outcomes and predictors.Our method is in between.We specify cluster structures and enforce characteristics of each cluster by model priors, but we do not have true cluster memberships as the outcome to train the model.So we call our method a pseudo-supervised learning approach.
In our pseudo-supervised learning approach, the prior modeling is the key component, which regulates SNPs into the correct clusters.In the machine learning literature, regularized regression models (e.g., Lasso and Elastic Net) always have their Bayesian equivalent counterpart (Bayesian Lasso [46] and Bayesian Elastic Net [47]).In these Bayesian approaches, shrinkage priors are used to achieve the equivalent penalty effect in regularized regressions.We adopt this idea and let the prior distributions guide the discovery of the patterns.In our model, the number of clusters is fixed, and the pattern of each cluster is described and enforced by the prior distributions.
Using the same validation approach as in Magrangeas et al. (2016) [29], we validated the same SNP identified by Magrangeas et al. (2016) [29].No more SNPs were validated partly due to the inconsistency of signals between the discovery dataset and the validation dataset.First, not many SNPs have strong signals in both studies.See Table (a) in Supplementary File S5 where we ranked SNPs by raw P-value from smallest to largest.Among the top 1000 ranked SNPs in the discovery dataset with smallest raw P-values, only 30 SNPs have a raw P-value <0.05 for the validation data (see Table (b) in Supplementary File S5).Also, the ranks of these 30 SNPs are quite low in the validation set.Second, many SNPs have signals from opposite directions between the discovery set and the validation set.Among the top 1000 SNPs, 513 of them have opposite sign of MAF difference between cases and controls in the two datasets (Table (a) in Supplementary File S5).This means more than half of the topranked SNPs are risk factors in one study but protective factors in another study.
Among the 30 SNPs with reasonably strong signals in both studies, as mentioned above, nine SNPs showed opposite directions of MAF difference between cases and controls.

Conclusion
Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human diseases by identifying disease-associated single-nucleotide polymorphisms (SNPs).We developed a novel method for SNP discovery based on model-based clustering, which can also be considered as a pseudo-supervised machine learning approach.We compared our method with the traditional SNP-wise approach through simulation studies and a real data analysis.The traditional SNP-wise approach is over-conservative since its adjustment for multiple testing is purely based on Pvalues, insufficiently accounting for the relationship between SNPs.Therefore, its true FDR is always less than the nominal level and has less power to detect true signals.In comparison, our method can better control FDR at nominal level and detect more effective SNPs.In addition, our method simultaneously models all SNPs but makes computing feasible by integrating out nuisance parameters from the model.
In the re-analysis of the real data from Magrangeas et al. (2016) [29], the traditional method failed to detect any significant SNP after FDR adjustment.In contrast, our proposed method not only detected effective SNPs at the genome-wide significance level, which were reported in Magrangeas et al. (2016) [29] with a much larger P-value threshold than the genome-wide significance level, but also identified a novel BiPNassociated SNP rs4351714 that has been proven to be associated with multiple myeloma.
In summary, our method outperforms the traditional SNP-wise approach in SNP discovering from case-control GWAS.

Notations and 3-cluster mixture models
Suppose we measure genotypes of G SNPs for n : MM patients with BiPN (cases) and n < MM patients without BiPN (controls).Our goal is to identify a subset of SNPs that are significantly associated with the risk of developing BiPN.For each SNP, we code its genotype as: 0 minor allele (wild-type homozygote), 1 minor allele (heterozygote), and 2 minor alleles (mutation homozygote).We assumed Hardy Weinberg Equilibrium (HWE) for each SNP.Then the genotype frequencies of a SNP can be expressed as functions of the Minor Allele Frequency (MAF) θ: Pr(genotype=0) = (1 − θ) ?, Pr(genotype=1) = 2θ(1 − θ), and Pr(genotype=2) =θ ? .If a SNP has significantly different MAFs between cases and controls, then this SNP is associated with the risk of developing BiPN.
We assume that there are three clusters of SNPs: (1) no effect cluster: cluster of SNPs having similar MAF between cases and controls (denoted as cluster 0); (2) positive effect cluster: cluster of SNPs having significantly higher MAF in cases than in controls (denoted as cluster +); and (3) negative effect cluster: cluster of SNPs having significantly lower MAF in cases than in controls (denoted as cluster −).That is, a MM patient with minor alleles of any SNP in cluster + tends to develop BiPN after receiving Bortezomib (i.e., having positive tendency in developing BiPN), while a MM patient with minor alleles of any SNP in clustertends to be protective from developing BiPN (i.e., having negative tendency in developing BiPN).SNPs in cluster 0 do not affect developing BiPN.
We use vectors of binary latent variables to describe the unknown cluster memberships of SNPs.For a given SNP g, its cluster membership can be expressed as   = Fz H,I , z H,J , z H,K L. Values of  can be interpreted as pseudo count [48,49] for each cluster.
The primary objective of the data analysis is to estimate the posterior distribution of z H given all observed SNP data and values of all model parameters, which will be used for the inference about which SNPs are significantly associated with the outcome.What we described above is mixture models with three components.Next, we discuss in details about how the mixture models can be derived from mixture of Bayesian hierarchical models by modeling different patterns of the relationship between SNPs and binary outcomes of case-control status inside each cluster.

Model and prior of genotype and MAFs
For a given SNP g of patient i under condition d, we denote its genotype and minor allele frequency (MAF) in cluster k as S H,f,g and θ H,f,M respectively, where d = x (case) or y (control), g = 1, …, G, i = 1, …, n f , and k=0, +, −.We modeled the distribution of the genotype S H,f,g of SNP g by the following multinomial distribution: Note for SNPs in cluster 0, we have θ H,:,I = θ H,<,I ; For SNPs in cluster +, we have θ H,:,J > θ H,<,J ; and for SNPs in cluster −, we have θ H,:,K < θ H,<,K .These conditions will be enforced by prior distributions of MAF discussed below.For a SNP in cluster +, i.e., SNPs having larger MAF in cases than in controls, we define a "half-bell shape" prior which enforces patterns of cluster + with probability 1.
The indicator function I(θ H,:,J > θ H,<,J ) enforces the constraint θ H,:,J > θ H,<,J with probability 1.It "flattens" half of the bivariate distribution to change the shape into "halfbell", which makes a PDF of θ H,:,J and θ H,<,J satisfy our constraints with probability 1.
In GWASs, the set of parameters {θ H,f,M } contains huge numbers of elements due to a large number of SNPs to be considered.Note that all {θ H,f,M } are nuisance parameters, since they are not used for final inference of association between SNPs and outcomes.
Hence, we integrate out all nuisance parameters {θ H,f,M } from the model by averaging the complete likelihood over their prior distributions (more details in Section A of

Supplementary File S3). Marginalization reduces a huge number of unnecessary
To estimate the responsibilities γ H,M , we first apply the EM algorithm to estimate the model parameters, mixture weights π M , and then plug the estimated model parameters into Formula (2).See Section B of Supplementary File S3 for details on the implementation.
A straightforward decision rule about cluster membership is to assign each SNP to the cluster with the highest posterior probability, i.e., assign SNP g to the cluster corresponding to the largest value of γ H,I , γ H,J , and γ H,K .
An alternative is to assign a SNP to effective clusters (+ or −) if its responsibility of coming from cluster + or − is greater than a threshold τ.Following Yuan and Kendziorski (2006) [52], the value of τ can be specified to achieve a desired level of false discovery rate (FDR) given as follows [53]: where "card{set}" means the number of elements in "set".
The second approach is preferred in most applications, since controlling FDR of detected SNPs is often desired for genomic studies.Users can decide which decision rule to use based on their specific applications.

Design of simulation studies
error (false positive rate).FDR is much more popularly used in genomic studies.Hence, we decided to control FDR instead of specificity.
(approximated beta distributions via the moment matching: dashed lines are Betaapproximations of truncated Beta(2, 5) and dotted lines are Beta-approximations of empirical distributions estimated from data).The middle panel shows boxplots for FDR and the vertical dashed line represents the nominal level (0.05).The right panel shows boxplots for paired difference of sensitivity between our method (using truncated Beta (2,5) or empirical distribution for analysis) and the SNP-wise approach, and the vertical dashed line represents 0 (i.e., same performance between our method and the SNP-wise approach).White boxes represent the SNP-wise approach, light grey boxes represent our method using truncated Beta(2, 5) for analysis, and dark grey boxes represent our method using empirical distributions for analysis.Plain solid circles indicate model parameters to be estimated, while dashed circles represent nuisance parameters to be integrated out from the model likelihood by marginalization.Gray-filled circles represent pre-specified hyper-parameters.
Magrangeas et al. (2016)[29] to evaluate the genetic effects on developing severe bortezomib-induced peripheral neuropathy (BiPN).The discovery dataset (GSE65777) contains 909,622 SNPs and 469 newly-diagnosed patients with multiple myeloma treated with bortezomib.The goal was to compare SNP genotypes from 155 patients with grade ≥ 2 BiPN with those from 314 MM patients with grade 1 BiPN or no BiPN.The validation dataset (GSE66903) contains 795,734 SNPs and 116 MM patients treated with bortezomib.The goal of the validation study was to compare 41 bortezomib-treated grade ≥ 2 BiPN patients with 75 bortezomib-treated control patients.
Let z H,M = 1 indicate that SNP g belongs to cluster k, and z H,M = 0 otherwise, where k = 0, +, − is the index of clusters.Here the indicators must satisfy ∑ z H,M = 1 M∈{I,J,K} .Let  = (π I , π J , π K ), satisfying ∑ π M = 1 M∈{I,J,K} , be the proportion of SNPs in three clusters, which is called a mixture proportion.So the cluster membership z H follows a multinomial distribution as PrFz H WL = Multinomial(1, ) = ∏ π M _ `,a M∈{I,J,K} .We assume a Dirichlet prior Dir() on ,  = (b I , b J , b K ).

Figure 2 .
Figure 2. Directed acyclic graph representation of our model-based clustering MAFs in different clusters can be distinguished and enforced by different prior distributions.Shared prior distributions of MAFs allow them to borrow strength from each other.If a SNP has no effect on the outcome, it should have the same MAF in cases and controls.Hence, we use the same conjugate prior for both cases and controls: θ H,f,I ~ Beta(α, β).We denote its Probability Density Function(PDF) as h(•), and use this PDF to help construct PDFs of the prior distributions for the other two clusters.