Introduction

A retrospective study is highly popular in genetic epidemiology study because of its economic cost and substantially reduced study duration compared with a prospective design. The data in a retrospective design are not drawn from the general population and they are randomly sampled from each subpopulation and the numbers of subjects chosen from each individual subpopulation are usually matched. In the last decade, the retrospective case-control genetic association studies, especially genome-wide association studies, have been considered as a big success in searching for the deleterious genetic susceptibilities1,2,3. By now, more than ten thousand single nucleotide polymorphisms (SNPs) have been identified to be associated with human complex diseases (http://www.genome.gov/gwasstudies). There are two types of phenotypes: continuous and discrete. The majority of the discrete phenotypes are binary and ordinal. The logistic regression model is a major tool to analyze the binary phenotypes because the odds ratio estimator from the logistic regression model based on case-control data is equivalent to that from the same model by taking the data as being sampled from a prospective study4,5,6. Although there is a lack of identification of the intercept, it does not matter because the intercept is not concerned in practice. Compared with that using two statuses (case and control) to define the medical outcomes, an ordinal description with three or three more values might be more accurate to measure the quality of life for some human complex diseases. For example, there are three levels for depicting the degree of severity of carcinoid heart disease (CHD): without CHD, mild CHD and severe CHD7 and four levels for those of live steatosis: normal liver, light steatosis, moderate steatosis and severe steatosis8.

Several procedures were proposed to analyze the retrospective data with ordinal responses in the literatures. An ad hoc approach is to use the proportional odds model9 by taking the retrospective data as being enrolled prospectively. However, it is not appropriate because the proportional odds model does not belong to the multiplicative intercept risk model10,11 and the resulting maximum likelihood estimator (MLE) of the interested parameter is not consistent to its true value except for the scenario that the true value of the concerned parameter is 0. So, under a discrete choice probability model, Cosslett10 proposed to maximize a modified likelihood function to get the MLE; Wild11 considered fitting the proportional odds model to case-control data from a finite population with known population totals in each response category and obtained the MLE. Based on the final optimization function, it revealed that Wild’s MLE is identical to that of Cosslett.

The Hardy-Weinberg equilibrium (HWE) law is a very important principal in population genetics. It is a routine to check whether the observed genotypes satisfy the HWE law in control population before conducting an association test, because deviations from HWE can indicate many problems such as population stratification, genotyping error and so on12,13,14. In a genome-wide association study, the threshold of p-value is 10−4 for the HWE test to ensure that there is no possible systematic genotyping error in the sampled individuals. On the other hand, checking whether the HWE law holds in case population has been used as an association test for fine-mapping of the disease loci15,16. In a further way, the HWE law has also been advocated in many associated studies. For example, Wang and Shete17 derived a powerful test by incorporating the derivations of HWE in cases for single-marker analysis; Zheng and NG18 proposed a powerful two-phase analysis by using the HWE test to classify the genetic models; Chen et al.19 considered testing the gene-environment interaction by assuming that the HWE holds in the controls. Consider a biallelic SNP locus with two alleles A and a. Denote the allele frequency of A by p. Under the HWE principal, the genotype frequencies of AA, Aa and aa are p2, 2p(1 − p) and (1 − p)2, respectively.

To the best of our knowledge, most of the exiting methods in the literatures focused on the estimation of the parameters. Although the test statistic such as the score test or the Wald test derived from the proportional odds model is still valid and has been used in practice, such as the CHD study7 and the liver study8, we will show that it might lose power under the alternative, especially when the genetic effect size is large. In this work, by incorporating HWE principal in control population, we obtain a new estimator, which optimizes the newly modified likelihood function. Using this, we derive the score test statistic, which is shown to be more powerful than the exiting methods through extensive computer simulations. Finally, we apply it to 45 SNPs in the region of TRAF1-C5 for the association with four-level anticyclic citrullinated protein antibody from Genetic Analysis Workshop 16 and find that there are three SNPs significantly associated with anticyclic citrullinated protein antibody measure at the genome-wide significance level of 10−7.

Results

Simulation Settings

We compare the performances of three estimators: proMLE (the MLE derived from the likelihood function by taking the data as being arisen prospectively), modMLE (the MLE derived from the modified likelihood function) and hweMLE (the proposed method). What needs illustration is that the parameters used in this section are defined in the following “Notation” section. Since in the real application analyzed later, J = 4, we consider J = 4 in the simulations with Pr(Y = 1) = 0.98, Pr(Y = 2) = 0.01, Pr(Y = 3) = 0.006 and Pr(Y = 4) = 0.004, which results in θ1 = 3.89, θ2 = 4.59 and θ3 = 5.51 under β = 0. We choose β  {ln1.2, ln1.4, ln1.6, ln1.8} and the minor allele frequency (MAF) p {0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5}. Let n1 {1, 200, 500, 300, 200} and n2, n3 and n4 be drawn from a multinomial distribution Mul(n1, q), where n1 = n2 + n3 + n4 and q = (0.5, 0.3, 0.2)τ is the probability vector which is proportional to the corresponding prevalence rates of the case statuses with (P(Y = 2), P(Y = 3), P(Y = 4))τ. The reason why we choose several values of n1 is to make the power comparable for different genetic effect sizes. We consider two significance levels: 0.05 and 0.001. 1,000 and 50,000 replicates are conducted to calculate the empirical type I error rates and powers for the significance levels of 0.05 and 0.001, respectively.

Point Estimate

Figures 1, 2, 3, 4 show the boxplots of the above three estimators corresponding to β = ln 1.2, β = ln 1.4, β = ln 1.6 and β = ln 1.8, respectively, where we set the same value of n1 (= 500). 1,000 replicates are conducted. As expected, the proMLE is biased and the proposed hweMLE are unbiased. Interestingly, the proMLE underestimates β in most cases with the median values being smaller than the true values, while the modMLE overestimates β a little bit with the median values being greater than the true values. The absolute value of bias of the proMLE increases as β increases. For example, when MAF = 0.25, the bias of the proMLE for β = ln 1.2, ln 1.4, ln1.6 and ln 1.8 are −0.023, −0.051, −0.068 and −0.091, respectively. From these boxplots, when β is away from zero, the proposed hweMLE performs the best, followed by modMLE, then proMLE based on the bias of the median value. For instance, when β = ln1.2 (= 0.182) and MAF = 0.10, the median values of proMLE, modMLE and hweMLE are 0.163, 0.202 and 0.192, respectively, while for β = ln 1.4 (= 0.336) and the same MAF, the median values of proMLE, modMLE and hweMLE are 0.276, 0.340 and 0.338, respectively. Table S1–S4 in the supplemental material summarize the results of the empirical bias and square root of mean squared error (srMSE). These tables indicate that the empirical bias of hweMLE is the smallest among the three estimators and the srMSE of hweMLE is the smallest under most of the considered scenarios, especially when β is relatively large.

Figure 1
figure 1

The empirical point estimates of proMLE, modMLE and hweMLE for β = In 1.2.

Figure 2
figure 2

The empirical point estimates of proMLE, modMLE and hweMLE for β = In 1.4.

Figure 3
figure 3

The empirical point estimates of proMLE, modMLE and hweMLE for β = In 1.6.

Figure 4
figure 4

The empirical point estimates of proMLE, modMLE and hweMLE for β = In 1.8.

Type I error rate

As is shown in the “Methods” section, the observed Fisher information matrix of the modMLE is close to singular since there is an equality among Δjs, j = 2, 3,…, J based on the reciprocal of case-control design. So we compare two test statistics. One is the score test derived from the proportional odds model by taking the data as being arisen prospectively. For convenience, we denote it by proT; Another is the proposed hweT. Table 1 shows the empirical type I error rates for the MAF ranging from 0.1 to 0.5 and the nominal significance levels of 0.05 and 0.001. We set n = 1,000. 1,000 and 50,000 replicates are conducted to calculate the empirical type I error rates. The results indicate that both proT and hweT can control the type I error rates correctly with the empirical values being close to the nominal level. For example, when the MAF is 0.20 and the nominal level is 0.05, the empirical type I error rates of proT and hweT are 0.052 and 0.044, respectively and when the MAF is 0.15 and the nominal level is 0.001, those of proT and hweT are, respectively, 0.00086 and 0.00078.

Table 1 The empirical type I errors of proT and hweT.

Power Comparison

In this part, we explore the power performances of proT and hweT. For the convenience, we assume . In order to make the power comparable, we set the small sample size for large β. In details, we set n = 1,000, 500, 300 and 200 for β = ln 1.2, ln 1.4, ln 1.6 and ln 1.8, respectively, under the nominal significance level of 0.05 and n = 2,400, 1,000, 600 and 400 for β = ln 1.2, ln 1.4, ln 1.6 and ln 1.8, respectively, under the nominal significance level of 0.001. We conduct 1,000 and 50,000 replicates for the significance level of 0.05 and 0.001. Figures 5 and 6 show the power results. Both figures indicate that the proposed hweT is more powerful than the proT. In some cases, there is 6% power increase. For example, when n = 1,000, MAF = 0.35 and β = ln 1.4, the power of hweT is 0.582, which is much larger than that 0.522 of proT under the significance level of 0.001.

Figure 5
figure 5

The empirical powers of proT and hweT for β = In 1.2, In 1.4, In 1.6 and In 1.8 under the significant level α = 0.05.

Figure 6
figure 6

The empirical powers of proT and hweT for β = In 1.2, In 1.4, In 1.6 and In 1.8 under the significant level α = 0.001.

Application to Four-level Anticyclic Citrullinated Protein Antibody data

The region of TRAF1-C5 in human genome has been shown to be associated with rheumatoid arthritis (RA) based on both genome-wide association study20,21 and candidate gene approach22. The anticyclic citrullinated protein (anti-CCP) antibodies have been frequently found in the blood of the individuals with RA23. It is reasonable to assume that there are associations between the anti-CCP measure and the SNPs in the region of TRAF1-C5. To test this hypothesis, we apply the hweT and the proposed hweMLE procedures to the data from the Genetic Analysis Workshop 16 (GAW16)24. This data consists of 2,062 subjects. Based on the anti-CCP measure, the subjects can be divided into four categories: without RA, below 20; low or weak, 20–39; moderate, 40–59; high or strong, > 60. The number of subjects are 1,195, 103, 66 and 698 corresponding to the above four categories, respectively. There are 45 SNPs in the region of TRAF1-C5 on Chromosome 9. The snpids (SNP ID), positions, the point estimators and the p-values of the existing and the proposed procedures are summarized in Table 2. Before conducting the association analysis, we use the HWD coefficient25, denoted as D, to test whether the HWE law holds in the controls. When the HWE law holds in the controls, D = 0. The HWE test is given by  , where , nc0, nc1 and nc2 are the counts of the subjects possessing the genotype 0, 1 and 2, respectively and nc = nc0 + nc1 + nc2. Under that D = 0, the HWE test follows the standard normal distribution. The results in Table 2 show that the HWE law holds in the controls for these 45 SNPs under the significance level of 0.05. Then we apply the proposed hweT and proT to test for the associations between these 45 SNPs and the anti-CCP measure. We find that the significance of the association between these SNPs and the anti-CCP measure using the proposed hweT is always stronger than those using the proT. For example, we can identify three SNPs, rs1953126, rs881375 and rs3761847 with p-values less than 10−7 using the proT or the hweT. However, we can identify another five SNPs including rs10760130, rs10985073, rs2900180, rs7037673 and rs1468673 with p-values being less than 0.0001 using the hweT, while only three SNPs rs1953126, rs881375 and rs2900180 can be identified using the proT. In addition, we use the Fisher-combined method to combine the p-values over these 45 SNPs as . The combined values of Tcom for the proT and hweT are 408.9 and 512.2. Based on 1,000 bootstrap replicates, we calculate the p-values of Tcom for the proT and hweT. Both are less than 0.001. This indicates that the gene TRAF1-C5 is associated with the anti-CCP measure and that the hweT can detect association signals easily than the proT.

Table 2 The point estimates of β and p-values for 45 SNPs in region of TRAF1- C5 for the association with 4-level anti-CCP measure.

Discussion

When using the logistic regression to handle the binary response outcome in genetic association studies, it has been shown that the odds ratio estimate based on the MLE is equivalent to that from the same model by taking the data as being arisen prospectively4,5,6. However, this equivalence does not hold for the proportional odds model. Cosslett10 and Wild11 proposed to obtain a consistent estimator through optimizing a modified likelihood function. In this work, by incorporating HWE principal in the retrospective likelihood function, we extend Cosslett’s procedure and obtain a consistent and asymptotically unbiased estimator. Based on this estimator, we construct the score test statistic. Numerical results show that the MLE from the prospective proportional odds model is substantially biased and the proposed estimator is consistent and the proposed score test statistic is powerful than that constructed from a prospective likelihood function.

HWE principal is very important in genetic association studies. It is often considered to be a cornerstone for further statistical inference. Departure from HWE often result from inbreeding, population migration and genotyping errors. Researchers have suggested that the deviation of HWE among cases can provide additional evidence for the associations between genetic variants and human diseases17,19. As shown in the results, incorporating HWE into the proportional odds model can also improve the efficiency of the estimate of genetic effect and also improve the power to identify the deleterious genetic variants. We also explore the performance of the proposed procedure when the HWE is violated. The simulation results are available in the supplementary material, which indicate that the proposed procedures work well when the HWE is violated slightly. Actually, if the HWE law is violated, we can estimate the parameters of interest through assuming that the genotype frequencies in the control group satisfy Pr(G = 0|Y = 1) = p0, Pr(G = 1|Y = 1) = p1, Pr(G = 2|Y = 1) = p2. Thus there is one additional parameter that needs to be estimated in the proposed modified likelihood function. At this point, the number of parameters is larger than that under the assumption of HWE. Hence, the biases of the estimators tend to bigger than those under the assumption that the HWE law holds.

The sandwich variance estimate is a common tool used to estimate the variance of quasi-likelihood estimates from generalized estimating equations (GEE)26. However, Kauermann and Carroll27 proved that the sandwich variance estimator has the downward bias with O(n−1) order for the quasi-likelihood estimates from GEE, where n is the total sample size, because it is derived based on the first-order approximation of the Taylor expansion about the estimating equation. Thus in our case, if we use the sandwich variance estimator to construct the test statistic, this may result in inflated type I error rate. Hence, we use the summation of the first derivatives of the likelihood function on the individual observation to estimate the variance of the MLE. It should be noted that the used variance estimate is a consistent estimate based on the law of large numbers.

Methods

Notations

Consider a biallelic SNP and the genotype at a marker locus is coded as 0, 1 or 2, with the value corresponding to the copy number of a certain candidate allele. Let Y be a J ordered status response variable and G be a random variable taking the genotype values of the subjects at a SNP locus. Without loss of generality, let Y = 1 denote the status of a healthy individual and Y = j denote the status of a diseased subject, j = 2, 3,…, J. Then the standard proportional odds model9 is

where β is the parameter of interest, which is called log-odds ratio when J = 2 and θj, j = 1, 2,…, J − 1 are the intercepts. Denote ϕ(x) = 1/(1 + exp(−x)) for x. Using (1), we have

Let be the genotypes of the nj subjects who are randomly sampled from the jth subpopulation for j = 1, 2,…, J. Denote as the total sample size.

Consistent Estimate

If we take the data as being collected from a prospective study, the prospective likelihood function is

where θ = (θ1, θ2,…,θJ−1)τ and τ denotes the transpose of a vector or a matrix. The corresponding log-likelihood function is

As shown in Cosslett10, using the above model to analyze the retrospective data directly often leads to a biased estimate of β when β ≠ 0 and the bias increases as β increases. So, Cosslett10 proposed to optimize the following modified log-likelihood function to get the estimate of β:

where Δ = (Δ2, Δ3,…, ΔJ)τ and Δ1 = n1/n.

Based on the reciprocal of case-control design where all case groups are randomly sampled from the case population, the structure among different case groups in the sample is the same as that in the general case population. So each case group should have the same degree of importance which yields Δ2 = Δ3 = … = ΔJ. Taking this equality into consideration, the score test statistic cannot be constructed using lm(β, θ, Δ) because the observed Fisher information matrix of (β, θτ, Δτ)τ is close to be singular. So, in the following part, we will derive a MLE through incorporating this equality and the HWE law. Suppose that the HWE principle holds in the control population with the minor allele frequency p. Thus Pr(G = 0|Y = 1) = (1 − p)2, Pr(G = 1|Y = 1) = 2p(1 − p), Pr(G = 2|Y = 1) = p2. From the Supplemental Material, we set , , where , w = [(1 − p)2/ϕ(θ1) + 2p(1 − p)/ϕ(θ1 − β) + p2/ϕ(θ1 − 2β)]. Then, the modified likelihood function is rewritten as

and the log-likelihood function is

where m0, m1 and m2 are the numbers of the subjects with genotypes 0, 1 and 2, respectively, in the sample. We adopt two steps to estimate the parameters. We first estimate the parameter p using the observations in controls and denote the estimator by . Based on the law of large numbers, we know that converges to p almost surely. Then we optimize lh(β, θ, p) according to β and θ under to obtain the estimate of β and θ through. Denote the estimator of (β, θτ)τ by . Then from the theorem in the Supplemental Material, is consistent to the true value of β and asymptotically follows a standard normal distribution, where , is the (1, 1)th element of the matrix I−1(β, θ),

and lh,ij = ln(Δj) + ln[ϕ(θj − gijβ) − ϕ(θj−1 − gijβ)] − ln{Δ1ϕ(θ1 − gijβ) + Δ2[1 − ϕ(θ1 − gijβ)]} for i = 1, 2,…, nj and j = 1, 2,…, J.

Test Statistic

In genetic association studies, the most concern of investigators is whether the genetic variant is associated with the disease. One can construct the Wald test statistic based on the asymptotic normality of . Another commonly employed test statistic is the score test statistic. Denote Aβ = n1Aβ1/Aβ2, where , and the MLE of θ under β = 0 by . Then, the score function is

and the score test statistic (denote it by hweT) is

where is defined as above. Under the null hypothesis, hweT asymptotically follows the standard normal distribution.

Additional Information

How to cite this article: Zhang, W. et al. Fitting Proportional Odds Model to Case-Control data with Incorporating Hardy-Weinberg Equilibrium. Sci. Rep. 5, 17286; doi: 10.1038/srep17286 (2015).