Abstract
It is now widely recognized that haplotype information inferred from genotypes can be of great interest to better characterize the role of a candidate gene in the etiology of a complex trait in the context of association studies. Several works have recently advocated the simultaneous estimation of haplotype frequencies and haplotype effects in order to get a better efficiency in parameter estimation. Most of the available models can deal with a binary or a quantitative phenotype, but none has yet discussed the application of haplotype-based association analysis to a survival outcome. We describe how the recently proposed Stochastic-EM (SEM) algorithm can be applied to estimate haplotype effects in censored data analysis using a standard Cox proportional hazards formulation. This model has been implemented in the THESIAS software freely available at http://www.genecanvas.org
Similar content being viewed by others
Introduction
It is now widely recognized that haplotype information inferred from genotypes can be of great interest to better characterize the role of a candidate gene in the etiology of a complex trait.1, 2, 3, 4 Haplotype-based analysis may help in differentiating the true effect of a polymorphism from what is due to its linkage disequilibrium with other variant(s). Haplotypes may serve as better markers for unknown functional variants than single polymorphisms. Lastly, they may define functional units whose effects cannot be predicted from what is known of the individual effect of each variant. This explains the large amount of work that has been devoted to the development of statistical tools for making haplotype inference.4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 It is now widely admitted that haplotype frequencies and haplotype effects have to be estimated simultaneously in order to get a better efficiency in parameter estimation.4, 7, 8, 11, 12, 13, 14 To our knowledge, available models allowing this joint estimation can deal with a binary and/or a quantitative phenotype, but none has yet discussed the application of haplotype-based analysis to a survival outcome. The objective of this work is to describe how our recently proposed Stochastic-EM (SEM) algorithm13 can be extended to apply to an haplotype-based analysis of censored data using a standard Cox proportional hazards formulation.15
System and methods
Consider a sample of N unrelated individuals and let (T̃i, Di, Gi) denote the ith individual's triplet where T̃i=Ti∧Ci with Ti being his/her failure time or Ci his/her censoring time, Di=I(Ti≤Ci) and Gi being his/her genotypic vector at k different loci. For ease of presentation, only the case of di-allelic polymorphisms will be addressed here and we assume that Gi does not include any missing genotype even if these assumptions can be easily relaxed.13 The number of possible haplotypic pairs compatible with Gi is where ci is the number of loci where the ith individual is heterozygous. Except when ci≤1, the true haplotypic pair of the ith individual cannot be unambiguously deduced from Gi. Would the haplotypic pair Hi=(hi1, hi2) of the ith individual be observed, the contribution of this individual to the likelihood of the sample under the standard Cox formulation would be
where λ0(t) is an unspecified baseline hazard function and is the survival function at time T̃i. In this modeling, eβi1 and eβi2 represent the hazard risk ratios (HRRs) for the survival outcome associated with haplotypes hi1 and hi2, respectively, by comparison to a reference haplotype (which can be taken as the most frequent haplotype, for example), under the assumption of additive haplotype effects. is defined by
where Λ(T̃i) is the cumulative hazard function at time T̃i whose estimation will be detailed thereafter.
Algorithm
The SEM algorithm whose general description for haplotype-based association analysis has been given previously13 is an iterative algorithm where, at each iteration, any ambiguous haplotypic pair, considered as missing data, is replaced by a simulated value drawn from its conditional distribution given the observed data and the parameters obtained from the previous iteration.
The vector of parameters to be estimated, θ, is composed of the haplotype frequencies f(hl) (l=1…s) (s≤2k) and the logarithm of the haplotypic HRRs, βl (l=1…s). The (m+1)th iteration consists of two steps, the stochastic imputation step and the maximization step that take the following forms in the context of a Cox survival haplotype analysis.
The Stochastic-Imputation step
The unobserved haplotypic pair of an ambiguous individual i is set at a single draw from the distribution of haplotypic pairs H specified by P(H/T̃i, Di, Gi) evaluated at θ(m), the current vector estimated parameter at the mth iteration, and defined by:
where S(Gi) is the set of all haplotypic pairs Hj such that Hj=(hj1, hj2) is compatible with Gi and where P(Hj) is a function of the current estimated haplotype frequencies f(m)(hl).
The Maximisation step
With the pseudo-completed sample, a likelihood maximization routine is then used to obtain updated parameters θ(m+1). This can be decomposed into two parts. First, haplotype frequencies are obtained by counting the pseudo-observed haplotypic pairs Hi=(hi1, hi2) under the assumption of Hardy–Weinberg equilibrium (HWE). Then, the logarithm of the haplotypic HRRs are independently updated by the standard maximum likelihood (ML) estimates obtained from the partial Cox likelihood performed on the pseudo-completed data where the haplotypic pair of any individual is now considered to be observed, that is, by maximizing the following likelihood:
Given the updated β(m+1), the cumulated hazard function is then updated according to the Breslow estimates16 used in the context of Cox proportional hazards analysis.
To initialize the algorithm, a starting value θ(0) must be provided. For example, all βl's can be set to 0 and haplotype frequencies can be calculated assuming that all polymorphisms are in linkage equilibrium, that is, are the product of allele frequencies.
Let M be the total number of iterations of the SEM algorithm. The properties of the generated sequence of {θ(m)}, m=1…M, are detailed elsewhere.17, 18 The main results are that the sequence of {θ(m)} does not converge pointwise but composes a Markov chain that rapidly converges, under regularity conditions, to a stationary distribution.18 The stationarity is obtained after a sufficiently long ‘burn-in’ period and the point estimate θ̃ is then simply the mean of the θ(m) within this stationary distribution. The resulting SEM estimate θ̃ has been shown to be asymptotically equivalent to the ML estimate θ in the exponential family case17 and this equivalence has been observed in many other situations.
Once the SEM estimate θ̃ is obtained, we propose as parameter variance estimates those obtained by inverting the Fisher information matrix derived from the following likelihood expression evaluated at θ̃:
Finally, evaluating (3) at θ̃ provides an estimation of the partial Cox likelihood of the sample that can then be used for hypothesis testing by means of the likelihood ratio test statistics.
Discussion
In this report, we proposed a flexible model allowing the joint estimation of haplotype frequencies and haplotype effects in a context of survival analysis. This model is based on the Cox formulation15 that is considered as a standard in proportional hazard analysis. The estimates provided by the proposed SEM algorithm are expected to be close to the ML estimates even though the theoretical equivalence between the SEM and ML estimates has not been fully demonstrated in the case of the partial Cox likelihood. We compared on two real data sets,19, 20 the results provided by the proposed SEM algorithm to those obtained by a standard ML method for survival data analysis. However, since the implementation of a partial Cox likelihood with missing data (ie ambiguous haplotypes) is not easily tractable and can be quite computationnally cumbersome by use of the standard Newton–Raphson (NR) algorithm,13 we implemented a parametric Weibull model21 in our previous NR-based method for haplotype-association analysis,4, 22 and we compared estimates obtained by the two methods. Results of these comparisons are available online (httt://www.genecanvas.org). Even though the Cox and Weibull models are quite different in terms of the mathematical formulations and assumptions, they have been shown to produce similar results in many situations and the similarity between the parameter estimates provided here by both methods strengthened our confidence about the validity of the SEM algorithm. The limitations of the current model are the assumption of HWE at the haplotypic level and that of proportional hazards. Note, however, that the assumption of HWE is less questionable and more reasonable here in the whole population of a cohort than in a case–control design. It would also be interesting to develop a statistical tool to assess the goodness-of-fit of the Cox proportional hazards assumption under the framework of a haplotype-based association analysis.
While this manuscript was reviewed, a similar approach based on the EM algorithm was proposed.23 Even though the SEM and EM algorithms are expected to be asymptotically equivalent, it would be interesting to compare them in situations where asymptotic properties may not be valid, in particular in the case of rare haplotypes. Ambiguous haplotypes can be considered as variables observed with measurement error that would be a function of the LD pattern between the studied polymorphisms. Application of statistical methods dealing with errors in variables in Cox regression analysis24, 25, 26 may then be envisaged in the context of haplotype analysis and would deserve further attention.
This model has been implemented in the THESIAS program that can also deal with a quantitative or a binary phenotype, both under a standard and a matched (using a similar partial likelihood as that described above) case–controls designs. Our model is general enough to incorporate information on additional covariates and to test for the deviation from the hypothesis of additivity of the haplotypic effects. THESIAS is written in ANSIC and is available free of charge from http://www.genecanvas.org. THESIAS has already been used by different groups for real data analysis, either for a binary, a quantitative or a survival outcome and appears to be a tool of great usefulness for haplotype-based association study.
References
Drysdale CM, McGraw DW, Stack CB et al: Complex promoter and coding region beta 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc Natl Acad Sci USA 2000; 97: 10483–10488.
Klerkx AH, Tanck MW, Kastelein JJ et al: Haplotype analysis of the CETP gene: not TaqIB, but the closely linked −629C → A polymorphism and a novel promoter variant are independently associated with CETP concentration. Hum Mol Genet 2003; 12: 111–123.
Soubrier F, Martin S, Alonso A et al: High-resolution genetic mapping of the ACE-linked QTL influencing circulating ACE activity. Eur J Hum Genet 2002; 10: 553–561.
Tregouet DA, Barbaux S, Escolano S et al: Specific haplotypes of the P-selectin gene are associated with myocardial infarction. Hum Mol Genet 2002; 11: 2015–2023.
Stephens M, Smith NJ, Donnelly P : A new statistical method for haplotype reconstruction from population data. Am J Hum Genet 2001; 68: 978–989.
Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG : Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum Hered 2002; 53: 79–91.
Epstein MP, Satten GA : Inference on haplotype effects in case-controls studies using unphased genotype data. Am J Hum Genet 2003; 73: 1316–1329.
Niu T, Qin ZS, Xu X, Liu JS : Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet 2002; 70: 157–169.
Zhao LP, Li SS, Khalid N : A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case–control studies. Am J Hum Genet 2003; 72: 1231–1250.
Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA : Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet 2002; 70: 425–434.
Tanck MW, Klerkx AH, Jukema JW, De Knijff P, Kastelein JJ, Zwinderman AH : Estimation of multilocus haplotype effects using weighted penalised log-likelihood: analysis of five sequence variations at the cholesteryl ester transfer protein gene locus. Ann Hum Genet 2003; 67: 175–184.
Lake SL, Lyon H, Tantisira K et al: Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum Hered 2003; 55: 56–65.
Tregouet DA, Escolano S, Tiret L, Mallet A, Golmard JL : A new maximum likelihood algorithm for haplotype-based association analysis: the SEM algorithm. Ann Hum Genet 2004; 68: 165–177.
Stram DO, Pearce CL, Bretsky P et al: Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case–control study of unrelated individuals. Hum Hered 2003; 55: 179–190.
Cox DR : Regression models and life-tables (with discusssion). J Roy Statist Soc B 1972; 34: 187–220.
Breslow NE : Contribution to the discussion on the paper by DR Cox, Regression models and life tables. J Roy Statist Soc B 1972; 34: 216–217.
Diebolt J, Ip EHS : Stochastic EM: method and application; in Gilks WR, Richardson S, Spiegelhalter DJ (eds): Markov Chain Monte Carlo in practice. London: Chapman & Hall, 1996, pp 259–273.
Diebolt J, Celeux G : Asymptotic properties of a stochastic EM algorithm for estimating mixing proportions. Comm Statist B: Stoch Model 1993; 9: 599–613.
Tregouet DA, Barbaux S, Poirier O et al: SELPL gene polymorphisms in relation to plasma SELPLG levels and coronary artery disease. Ann Hum Genet 2003; 67: 504–511.
Ninio E, Tregouet D, Carrier JL et al: Platelet-activating factor-acetylhydrolase (PAF-AH) and PAF-receptor gene haplotypes in relation to future cardiovascular event in patients with coronary artery disease. Hum Mol Genet 2004, doi:10.1093/hmg/ddh145.
Cox D, Oakes D : Analysis of Survival Data. London, UK: Chapman & Hall, 1984.
Roussel R, Tregouet D, Hadjadj S, Jeunemaitre J, Marre M : Investigation of the human ANP gene in type 1 diabetic nephropathy: case–control and follow-up studies. Diabetes 2004; 53: 1394–1398.
Lin DY : Haplotype-based association analysis in cohort studies of unrelated individuals. Genet Epidemiol 2004; 26: 255–264.
Spiegelman D, McDermott A, Rosner B : Regression calibration method for correcting measurement-error bias in nutritional epidemiology. Am J Clin Nutrition 1997; 65: 1179S–1186S.
Wang CY, Hsu L, Feng ZD, Prentice RL : Regression calibration in failure time regression. Biometrics 1997; 53: 131–145.
Nakamura T : Proportionnal hazards model with covariates subject to measurement error. Biometrics 1992; 48: 829–838.
Acknowledgements
We wish to thank JL Golmard for his helpful comments on a earlier draft of this article and the AtheroGene Group for kindly providing us the data used for illustration.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tregouet, DA., Tiret, L. Cox proportional hazards survival regression in haplotype-based association analysis using the Stochastic-EM algorithm. Eur J Hum Genet 12, 971–974 (2004). https://doi.org/10.1038/sj.ejhg.5201238
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/sj.ejhg.5201238
Keywords
This article is cited by
-
A novel genetic marker of decreased inflammation and improved survival after acute myocardial infarction
Basic Research in Cardiology (2018)
-
No correlation between MTHFR c.677 C > T, MTHFR c.1298 A > C, and ABCB1 c.3435 C > T polymorphisms and methotrexate therapeutic outcome of rheumatoid arthritis in West Algerian population
Inflammation Research (2017)
-
A non-parametric approach for detecting gene-gene interactions associated with age-at-onset outcomes
BMC Genetics (2014)
-
Positive association of the vascular endothelial growth factor-A +405 GG genotype and poor survival in stage I–II gastric cancer in the Northern Chinese population
Molecular Biology Reports (2013)
-
Associations of non-metastatic cells 1 gene polymorphisms with lymph node metastasis risk of gastric cancer in Northern Chinese population
Tumor Biology (2012)