Fitting Proportional Odds Model to Case-Control data with Incorporating Hardy-Weinberg Equilibrium

Genetic association studies have been proved to be an efficient tool to reveal the aetiology of many human complex diseases and traits. When the phenotype is binary, the logistic regression model is commonly employed to evaluate the association strength of the genetic variants predispose to human diseases because the maximum likelihood estimator of the odds ratio based on case-control data is equivalent to that from the same model by taking the data as being arisen prospectively. This equivalence does not hold for the proportional odds model and using it to analyze the case-control data directly often results in a substantial bias. Through putting a parameter of the minor allele frequency in the modified likelihood function under the condition that the Hardy-Weinberg equilibrium law holds within controls, a consistent estimator is obtained. On the basis of it, we construct a score test statistic to test whether the genetic variant is associated with the diseases. Simulation studies show that the proposed estimator has smaller mean squared error than the existing methods when the genetic effect size is away from zero and the proposed test statistic has a good control of type I error rate and is more powerful than the existing procedures. Application to 45 single nucleotide polymorphisms located in the region of TRAF1-C5 genes for the association with four-level anticyclic citrullinated protein antibody from Genetic Analysis Workshop 16 further demonstrates its performance.

carcinoid heart disease (CHD): without CHD, mild CHD and severe CHD 7 , and four levels for those of live steatosis: normal liver, light steatosis, moderate steatosis, and severe steatosis 8 .
Several procedures were proposed to analyze the retrospective data with ordinal responses in the literatures. An ad hoc approach is to use the proportional odds model 9 by taking the retrospective data as being enrolled prospectively. However, it is not appropriate because the proportional odds model does not belong to the multiplicative intercept risk model 10,11 and the resulting maximum likelihood estimator (MLE) of the interested parameter is not consistent to its true value except for the scenario that the true value of the concerned parameter is 0. So, under a discrete choice probability model, Cosslett 10 proposed to maximize a modified likelihood function to get the MLE; Wild 11 considered fitting the proportional  odds model to case-control data from a finite population with known population totals in each response category and obtained the MLE. Based on the final optimization function, it revealed that Wild's MLE is identical to that of Cosslett. The Hardy-Weinberg equilibrium (HWE) law is a very important principal in population genetics. It is a routine to check whether the observed genotypes satisfy the HWE law in control population before conducting an association test, because deviations from HWE can indicate many problems such as population stratification, genotyping error and so on [12][13][14] . In a genome-wide association study, the threshold of p-value is 10 −4 for the HWE test to ensure that there is no possible systematic genotyping error in the sampled individuals. On the other hand, checking whether the HWE law holds in case population has  been used as an association test for fine-mapping of the disease loci 15,16 . In a further way, the HWE law has also been advocated in many associated studies. For example, Wang and Shete 17 derived a powerful test by incorporating the derivations of HWE in cases for single-marker analysis; Zheng and NG 18 proposed a powerful two-phase analysis by using the HWE test to classify the genetic models; Chen et al. 19 considered testing the gene-environment interaction by assuming that the HWE holds in the controls. Consider a biallelic SNP locus with two alleles A and a. Denote the allele frequency of A by p. Under the HWE principal, the genotype frequencies of AA, Aa and aa are p 2 , 2p(1 − p) and (1 − p) 2 , respectively. To the best of our knowledge, most of the exiting methods in the literatures focused on the estimation of the parameters. Although the test statistic such as the score test or the Wald test derived from the proportional odds model is still valid and has been used in practice, such as the CHD study 7 and  the liver study 8 , we will show that it might lose power under the alternative, especially when the genetic effect size is large. In this work, by incorporating HWE principal in control population, we obtain a new estimator, which optimizes the newly modified likelihood function. Using this, we derive the score test statistic, which is shown to be more powerful than the exiting methods through extensive computer simulations. Finally, we apply it to 45 SNPs in the region of TRAF1-C5 for the association with four-level anticyclic citrullinated protein antibody from Genetic Analysis Workshop 16 and find that there are three SNPs significantly associated with anticyclic citrullinated protein antibody measure at the genome-wide significance level of 10 −7 .

Results
Simulation Settings. We compare the performances of three estimators: proMLE (the MLE derived from the likelihood function by taking the data as being arisen prospectively), modMLE (the MLE derived from the modified likelihood function) and hweMLE (the proposed method). What needs illustration is that the parameters used in this section are defined in the following "Notation" section.  Table  S1-S4 in the supplemental material summarize the results of the empirical bias and square root of mean squared error (srMSE). These tables indicate that the empirical bias of hweMLE is the smallest among the three estimators and the srMSE of hweMLE is the smallest under most of the considered scenarios, especially when β is relatively large.
Type I error rate. As is shown in the "Methods" section, the observed Fisher information matrix of the modMLE is close to singular since there is an equality among Δ j s, j = 2, 3,…, J based on the reciprocal of case-control design. So we compare two test statistics. One is the score test derived from the proportional odds model by taking the data as being arisen prospectively. For convenience, we denote it by proT; Another is the proposed hweT. Table 1 shows the empirical type I error rates for the MAF ranging from 0.1 to 0.5 and the nominal significance levels of 0.05 and 0.001. We set n = 1,000. 1,000 and 50,000 replicates are conducted to calculate the empirical type I error rates. The results indicate that both proT and hweT can control the type I error rates correctly with the empirical values being close to the nominal level. For example, when the MAF is 0.20 and the nominal level is 0.05, the empirical type I error rates of proT and hweT are 0.052 and 0.044, respectively, and when the MAF is 0.15 and the nominal level is 0.001, those of proT and hweT are, respectively, 0.00086 and 0.00078.
Power Comparison. In this part, we explore the power performances of proT and hweT. For the convenience, we assume = ∑ = n n j j 1 1 4 . In order to make the power comparable, we set the small sample size for large β. In details, we set n = 1,000, 500, 300, and 200 for β = ln 1.2, ln 1.4, ln 1.6 and ln 1.8, respectively, under the nominal significance level of 0.05, and n = 2,400, 1,000, 600, and 400 for β = ln 1.2, ln 1.4, ln 1.6 and ln 1.8, respectively, under the nominal significance level of 0.001. We conduct 1,000 and 50,000 replicates for the significance level of 0.05 and 0.001. Figures 5 and 6 show the power results. Both figures indicate that the proposed hweT is more powerful than the proT. In some cases, there is 6% power increase. For example, when n = 1,000, MAF = 0.35, and β = ln 1.4, the power of hweT is 0.582, which is much larger than that 0.522 of proT under the significance level of 0.001.
Application to Four-level Anticyclic Citrullinated Protein Antibody data. The region of TRAF1-C5 in human genome has been shown to be associated with rheumatoid arthritis (RA) based on both genome-wide association study 20,21 and candidate gene approach 22 . The anticyclic citrullinated protein (anti-CCP) antibodies have been frequently found in the blood of the individuals with RA 23 . It is reasonable to assume that there are associations between the anti-CCP measure and the SNPs in the region of TRAF1-C5. To test this hypothesis, we apply the hweT and the proposed hweMLE procedures to the data from the Genetic Analysis Workshop 16 (GAW16) 24 . This data consists of 2,062 subjects. Based on the anti-CCP measure, the subjects can be divided into four categories: without RA, below 20; low or weak, 20-39; moderate, 40-59; high or strong, > 60. The number of subjects are 1,195, 103, 66, and 698 corresponding to the above four categories, respectively. There are 45 SNPs in the region of TRAF1-C5 on Chromosome 9. The snpids (SNP ID), positions, the point estimators, and the p-values of the existing and the proposed procedures are summarized in Table 2. Before conducting the association analysis, we use the HWD coefficient 25 , denoted as D, to test whether the HWE law holds in the controls. , n c0 , n c1 and n c2 are the counts of the subjects possessing the genotype 0, 1 and 2, respectively, and n c = n c0 + n c1 + n c2 . Under that D = 0, the HWE test follows the standard normal distribution. The results in Table 2 show that the HWE law holds in the controls for these 45 SNPs under the significance level of 0.05. Then we apply the proposed hweT and proT to test for the associations between these 45 SNPs and the anti-CCP measure. We find that the significance of the association between these SNPs and the anti-CCP measure using the proposed hweT is always stronger than those using the proT. For example, we can identify three SNPs, rs1953126, rs881375 and rs3761847 with p-values less than 10 −7 using the proT or the hweT. However, we can identify another five SNPs including rs10760130, rs10985073, rs2900180, rs7037673, and rs1468673 with p-values being less than 0.0001 using the hweT, while only three SNPs rs1953126, rs881375 and rs2900180 can be identified using the proT. In addition, we use the Fisher-combined method to combine the p-values over these 45 SNPs as . The combined values of T com for the proT and hweT are 408.9 and 512.2. Based on 1,000 bootstrap replicates, we calculate the p-values of T com for the proT and hweT. Both are less than 0.001. This indicates that the gene TRAF1-C5 is associated with the anti-CCP measure and that the hweT can detect association signals easily than the proT.

Discussion
When using the logistic regression to handle the binary response outcome in genetic association studies, it has been shown that the odds ratio estimate based on the MLE is equivalent to that from the same model by taking the data as being arisen prospectively 4-6 . However, this equivalence does not hold for the proportional odds model. Cosslett 10 and Wild 11 proposed to obtain a consistent estimator through optimizing a modified likelihood function. In this work, by incorporating HWE principal in the retrospective likelihood function, we extend Cosslett's procedure and obtain a consistent and asymptotically unbiased estimator. Based on this estimator, we construct the score test statistic. Numerical results show that the MLE from the prospective proportional odds model is substantially biased and the proposed estimator is consistent and the proposed score test statistic is powerful than that constructed from a prospective likelihood function.
HWE principal is very important in genetic association studies. It is often considered to be a cornerstone for further statistical inference. Departure from HWE often result from inbreeding, population migration and genotyping errors. Researchers have suggested that the deviation of HWE among cases can provide additional evidence for the associations between genetic variants and human diseases 17,19 .
Scientific RepoRts | 5:17286 | DOI: 10.1038/srep17286 As shown in the results, incorporating HWE into the proportional odds model can also improve the efficiency of the estimate of genetic effect and also improve the power to identify the deleterious genetic variants. We also explore the performance of the proposed procedure when the HWE is violated. The simulation results are available in the supplementary material, which indicate that the proposed procedures work well when the HWE is violated slightly. Actually, if the HWE law is violated, we can estimate the parameters of interest through assuming that the genotype frequencies in the control group satisfy Pr(G = 0|Y = 1) = p 0 , Pr(G = 1|Y = 1) = p 1 , Pr(G = 2|Y = 1) = p 2 . Thus there is one additional parameter that needs to be estimated in the proposed modified likelihood function. At this point, the number of parameters is larger than that under the assumption of HWE. Hence, the biases of the estimators tend to bigger than those under the assumption that the HWE law holds.
The sandwich variance estimate is a common tool used to estimate the variance of quasi-likelihood estimates from generalized estimating equations (GEE) 26 . However, Kauermann and Carroll 27 proved that the sandwich variance estimator has the downward bias with O(n −1 ) order for the quasi-likelihood estimates from GEE, where n is the total sample size, because it is derived based on the first-order approximation of the Taylor expansion about the estimating equation. Thus in our case, if we use the sandwich variance estimator to construct the test statistic, this may result in inflated type I error rate. Hence, we use the summation of the first derivatives of the likelihood function on the individual observation to estimate the variance of the MLE. It should be noted that the used variance estimate is a consistent estimate based on the law of large numbers.

Methods
Notations. Consider a biallelic SNP and the genotype at a marker locus is coded as 0, 1 or 2, with the value corresponding to the copy number of a certain candidate allele. Let Y be a J ordered status response variable and G be a random variable taking the genotype values of the subjects at a SNP locus. Without loss of generality, let Y = 1 denote the status of a healthy individual, and Y = j denote the status of a diseased subject, j = 2, 3,…, J. Then the standard proportional odds model 9 is where β is the parameter of interest, which is called log-odds ratio when J = 2, and θ j , j = 1, 2,…, J − 1 are the intercepts. Denote φ(x) = 1/(1 + exp(− x)) for x ∈  . Using (1), we have

Consistent Estimate.
If we take the data as being collected from a prospective study, the prospective likelihood function is As shown in Cosslett 10 , using the above model to analyze the retrospective data directly often leads to a biased estimate of β when β ≠ 0 and the bias increases as β increases. So, Cosslett 10 proposed to optimize the following modified log-likelihood function to get the estimate of β:  where Δ = (Δ 2 , Δ 3 ,…, Δ J ) τ and Δ 1 = n 1 /n. Based on the reciprocal of case-control design where all case groups are randomly sampled from the case population, the structure among different case groups in the sample is the same as that in the general case population. So each case group should have the same degree of importance which yields Δ 2 = Δ 3 = … = Δ J . Taking this equality into consideration, the score test statistic cannot be constructed using l m (β, θ, Δ ) because the observed Fisher information matrix of (β, θ τ , Δ τ ) τ is close to be singular. So, in the following part, we will derive a MLE through incorporating this equality and the HWE law. Suppose that the HWE principle holds in the control population with the minor allele frequency p. Thus Pr(G = 0|Y = 1) = (1 − p) 2   where σ β θ ( , ) ββ 2 is defined as above. Under the null hypothesis, hweT asymptotically follows the standard normal distribution.