Introduction

It is now widely recognized that haplotype information inferred from genotypes can be of great interest to better characterize the role of a candidate gene in the etiology of a complex trait.1, 2, 3, 4 Haplotype-based analysis may help in differentiating the true effect of a polymorphism from what is due to its linkage disequilibrium with other variant(s). Haplotypes may serve as better markers for unknown functional variants than single polymorphisms. Lastly, they may define functional units whose effects cannot be predicted from what is known of the individual effect of each variant. This explains the large amount of work that has been devoted to the development of statistical tools for making haplotype inference.4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 It is now widely admitted that haplotype frequencies and haplotype effects have to be estimated simultaneously in order to get a better efficiency in parameter estimation.4, 7, 8, 11, 12, 13, 14 To our knowledge, available models allowing this joint estimation can deal with a binary and/or a quantitative phenotype, but none has yet discussed the application of haplotype-based analysis to a survival outcome. The objective of this work is to describe how our recently proposed Stochastic-EM (SEM) algorithm13 can be extended to apply to an haplotype-based analysis of censored data using a standard Cox proportional hazards formulation.15

System and methods

Consider a sample of N unrelated individuals and let (T̃i, Di, Gi) denote the ith individual's triplet where T̃i=Ti∧Ci with Ti being his/her failure time or Ci his/her censoring time, Di=I(Ti≤Ci) and Gi being his/her genotypic vector at k different loci. For ease of presentation, only the case of di-allelic polymorphisms will be addressed here and we assume that Gi does not include any missing genotype even if these assumptions can be easily relaxed.13 The number of possible haplotypic pairs compatible with Gi is 2 c i - 1 where ci is the number of loci where the ith individual is heterozygous. Except when ci≤1, the true haplotypic pair of the ith individual cannot be unambiguously deduced from Gi. Would the haplotypic pair Hi=(hi1, hi2) of the ith individual be observed, the contribution of this individual to the likelihood of the sample under the standard Cox formulation would be

where λ0(t) is an unspecified baseline hazard function and is the survival function at time T̃i. In this modeling, eβi1 and eβi2 represent the hazard risk ratios (HRRs) for the survival outcome associated with haplotypes hi1 and hi2, respectively, by comparison to a reference haplotype (which can be taken as the most frequent haplotype, for example), under the assumption of additive haplotype effects. is defined by

where Λ(T̃i) is the cumulative hazard function at time T̃i whose estimation will be detailed thereafter.

Algorithm

The SEM algorithm whose general description for haplotype-based association analysis has been given previously13 is an iterative algorithm where, at each iteration, any ambiguous haplotypic pair, considered as missing data, is replaced by a simulated value drawn from its conditional distribution given the observed data and the parameters obtained from the previous iteration.

The vector of parameters to be estimated, θ, is composed of the haplotype frequencies f(hl) (l=1…s) (s≤2k) and the logarithm of the haplotypic HRRs, βl (l=1…s). The (m+1)th iteration consists of two steps, the stochastic imputation step and the maximization step that take the following forms in the context of a Cox survival haplotype analysis.

The Stochastic-Imputation step

The unobserved haplotypic pair of an ambiguous individual i is set at a single draw from the distribution of haplotypic pairs H specified by P(H/T̃i, Di, Gi) evaluated at θ(m), the current vector estimated parameter at the mth iteration, and defined by:

where S(Gi) is the set of all haplotypic pairs Hj such that Hj=(hj1, hj2) is compatible with Gi and where P(Hj) is a function of the current estimated haplotype frequencies f(m)(hl).

The Maximisation step

With the pseudo-completed sample, a likelihood maximization routine is then used to obtain updated parameters θ(m+1). This can be decomposed into two parts. First, haplotype frequencies are obtained by counting the pseudo-observed haplotypic pairs Hi=(hi1, hi2) under the assumption of Hardy–Weinberg equilibrium (HWE). Then, the logarithm of the haplotypic HRRs are independently updated by the standard maximum likelihood (ML) estimates obtained from the partial Cox likelihood performed on the pseudo-completed data where the haplotypic pair of any individual is now considered to be observed, that is, by maximizing the following likelihood:

Given the updated β(m+1), the cumulated hazard function is then updated according to the Breslow estimates16 used in the context of Cox proportional hazards analysis.

To initialize the algorithm, a starting value θ(0) must be provided. For example, all βl's can be set to 0 and haplotype frequencies can be calculated assuming that all polymorphisms are in linkage equilibrium, that is, are the product of allele frequencies.

Let M be the total number of iterations of the SEM algorithm. The properties of the generated sequence of {θ(m)}, m=1…M, are detailed elsewhere.17, 18 The main results are that the sequence of {θ(m)} does not converge pointwise but composes a Markov chain that rapidly converges, under regularity conditions, to a stationary distribution.18 The stationarity is obtained after a sufficiently long ‘burn-in’ period and the point estimate θ̃ is then simply the mean of the θ(m) within this stationary distribution. The resulting SEM estimate θ̃ has been shown to be asymptotically equivalent to the ML estimate θ in the exponential family case17 and this equivalence has been observed in many other situations.

Once the SEM estimate θ̃ is obtained, we propose as parameter variance estimates those obtained by inverting the Fisher information matrix derived from the following likelihood expression evaluated at θ̃:

Finally, evaluating (3) at θ̃ provides an estimation of the partial Cox likelihood of the sample that can then be used for hypothesis testing by means of the likelihood ratio test statistics.

Discussion

In this report, we proposed a flexible model allowing the joint estimation of haplotype frequencies and haplotype effects in a context of survival analysis. This model is based on the Cox formulation15 that is considered as a standard in proportional hazard analysis. The estimates provided by the proposed SEM algorithm are expected to be close to the ML estimates even though the theoretical equivalence between the SEM and ML estimates has not been fully demonstrated in the case of the partial Cox likelihood. We compared on two real data sets,19, 20 the results provided by the proposed SEM algorithm to those obtained by a standard ML method for survival data analysis. However, since the implementation of a partial Cox likelihood with missing data (ie ambiguous haplotypes) is not easily tractable and can be quite computationnally cumbersome by use of the standard Newton–Raphson (NR) algorithm,13 we implemented a parametric Weibull model21 in our previous NR-based method for haplotype-association analysis,4, 22 and we compared estimates obtained by the two methods. Results of these comparisons are available online (httt://www.genecanvas.org). Even though the Cox and Weibull models are quite different in terms of the mathematical formulations and assumptions, they have been shown to produce similar results in many situations and the similarity between the parameter estimates provided here by both methods strengthened our confidence about the validity of the SEM algorithm. The limitations of the current model are the assumption of HWE at the haplotypic level and that of proportional hazards. Note, however, that the assumption of HWE is less questionable and more reasonable here in the whole population of a cohort than in a case–control design. It would also be interesting to develop a statistical tool to assess the goodness-of-fit of the Cox proportional hazards assumption under the framework of a haplotype-based association analysis.

While this manuscript was reviewed, a similar approach based on the EM algorithm was proposed.23 Even though the SEM and EM algorithms are expected to be asymptotically equivalent, it would be interesting to compare them in situations where asymptotic properties may not be valid, in particular in the case of rare haplotypes. Ambiguous haplotypes can be considered as variables observed with measurement error that would be a function of the LD pattern between the studied polymorphisms. Application of statistical methods dealing with errors in variables in Cox regression analysis24, 25, 26 may then be envisaged in the context of haplotype analysis and would deserve further attention.

This model has been implemented in the THESIAS program that can also deal with a quantitative or a binary phenotype, both under a standard and a matched (using a similar partial likelihood as that described above) case–controls designs. Our model is general enough to incorporate information on additional covariates and to test for the deviation from the hypothesis of additivity of the haplotypic effects. THESIAS is written in ANSIC and is available free of charge from http://www.genecanvas.org. THESIAS has already been used by different groups for real data analysis, either for a binary, a quantitative or a survival outcome and appears to be a tool of great usefulness for haplotype-based association study.