Introduction

Longevity studies show lifespan correlation between related individuals such as twins, sibs and family members. These correlations can be caused by common environmental and genetic factors. Some genes responsible for longevity have already been discovered, but they explain only a small part of genetic variation. Genes associated with longevity can be identified using the information on genotype frequencies for two or more age groups. A significant age trend of these frequencies can indicate a gene-longevity association. The ‘gene frequency’ method is based on this idea.1, 2 This basic method can be extended using demographic information about a studied population to allow the estimation of initial frequencies, relative risks and the age trajectories of mortality for candidate genes.3, 4

Bi- and multivariate correlated frailty models allow us to search for the association between genes and disease or mortality in the absence of observed covariates. The simplest way to find this association is to compare correlations for monozygotic and dizygotic twins.5 More complex models can involve the decomposition of the frailty into additive genetic factors, genetic dominance factors, shared environmental factors and non-shared environmental factors.6, 7

In addition to the frailty models with continuously distributed frailty, the major genes models with discretely distributed frailty and mixed models with discretely and continuously distributed frailty can also be considered.8 The models with discretely distributed frailty have an important advantage, as they can be easily adapted to the family and pedigree data.9

If, in addition to longevity or morbidity data, the genetic markers data are also available, we can locate the position of the longevity or disease genes at the chromosome. Li and Zhong10 proposed a retrospective likelihood approach based on the allele-sharing test for the genetic linkage analysis using sibship data. Jonker et al.11 used an extension of this method to test for linkage and heritability. A weighted nonparametric linkage statistic has been proposed by Callegaro et al.12 to test for linkage for the selected samples. All these approaches use the correlated frailty model, where the frailty is broken down into the sum of the linkage effect and a shared residual effect. The components of the frailty must be calculated for each locus separately by maximizing the retrospective likelihood. This can lead to an enormous increase in calculating time if several genetic markers are involved in these calculations.

If the markers are in linkage disequilibrium, we can use them as covariates in the Cox-like regression and estimate respective regression coefficients. Significant deviation of some coefficients from zero can indicate that respective markers are involved in survival or disease. Even if the genetic markers are in linkage equilibrium, we can determine the position of longevity or disease genes at the chromosome using the linkage analysis.13 Under the assumption that location of the longevity or frailty gene does not influence survival, this approach involves the two-step procedure. In the first step, we estimate the parameters of the underlying hazard functions and the parameters of frailty distribution for the model with the major gene by maximizing the observed survival data likelihood. The second step is focused on determining the position of the longevity gene between observed markers.

In this paper, we propose an extension of this two-step method—a bivariate correlated competing risks survival model with two major genes and genetic markers data. The properties of this model are illustrated using examples based on simulated data.

Materials and methods

Survival analysis

Suppose that Ki related individuals belong to cluster i, i=1,...,n, and individuals from different clusters are statistically independent. Subjects from the same cluster can belong to a sibship or to a family. For subject k from cluster i, we denote the time to first failure or the censoring time, and the vector of time-independent covariates by tik and uik, respectively. Let lik∈(1,..., L), be the type of the first observed failure and lik=0 stand for right censoring. The censoring is denoted by δik=I(lik≠0), where I(x)=1 if x=‘True’ and I(x)=0, otherwise. We define the cause-specific hazard function for subject k from cluster i by using the formula:

where j=1,...,L, Zikj is an individual frailty for the failure type j, λ0j is an underlying cause-specific hazard, βj are cause-specific regression coefficients’ vectors and the symbol ‘′’ stands for transposition.14

If only one of the failure types can occur, the full hazard function for the same subject is defined by using the formula:

suppose that individual frailties Zikj can correlate for subjects from the same cluster and different types of failure. Dependency between subjects from a cluster can be caused by correlated genotypes for relatives. Complex diseases can be influenced by many genes and environmental factors and the same genes and factors can be involved in different diseases (onsets). This leads to correlations between types of failure (types of onsets or causes of death).

Example (two competing risks of death influenced by two major genes)

Assume that two causes of death are influenced by two unobserved major genes, and the data on the age at death or the age at censoring for each of the twins in the sample consisting of n twin pairs are available. Let λ01(t) and λ02(t) be the underlying hazards for the first and the second cause of death, respectively. Assume that two longevity alleles with dominant action a and b have frequencies pa and pb, respectively, and that they are located in different loci at the same or different chromosome. The neutral alleles in these loci are denoted by A and B with frequencies 1–pa and 1–pa, respectively. Suppose that the presence of at least one longevity allele a in genotype decreases the risk of the type 1 failure by factor r1, r1<1, and the risk of the type 2 failure by factor q2, q2<1. Similarly assume that the presence of at least one longevity allele b in the genotype decreases the risk of the type 2 failure by factor r2, r2<1, and the risk of the type 1 failure by factor q1, q<1. Suppose that the absence of the longevity allele corresponds to frailties Zik1=Zik2=1. If both longevity genes are in Hardy–Weinberg and linkage equilibrium and the action of the longevity genes does not depend on the location at the chromosome, then the possible longevity genotypes have the frequencies and frailties given in Table 1.

Table 1 Frailties and genotype frequencies. Both longevity genes are dominant

In this table notation (aa+aA) × BB; for example, means that a subject has one of the possible genotypes from the set (aa × BB, aA × BB). We assume that in the first locus of the genotype xX × yY, the allele x is inherited from the mother and allele X from the father. Similarly in the second locus of this genotype, the allele y is inherited from the mother and allele Y from the father. Suppose that the parents are genetically independent and that their offspring inherit their genotypes independently. The frequencies of an offspring’s genotypes depending on the parental genotypes can be calculated under the assumption that an offspring receives with equal probability one of the two alleles from its mother’s genotype and likewise one of the two alleles from its father’s genotype (the law of segregation).15

We denote the observed data in cluster i, i=1, ..., n, by (Xi, Ui, Li, δi), where Xi=(Xik), k=1,...,Ki, is the time to the first failure or time to censoring vector for subjects in cluster i, Ui=(uik) is the set of vectors of observed covariates, Li=(lik) is the vector of the types of failure and δi=(δik) is the vector of event indicators. Let us assume that given the observed covariates U and frailties Zikj, the censoring times are independent of the failure times and do not correlate with frailties, frailties are independent of covariates, and covariates’ effect is subject specific.14 If the underlying cause-specific hazard functions are known up to the vector parameter ω, we can write the likelihood function in the form of the following:

here are depending on ω, the cause-specific cumulative hazard functions, j=1, ..., L and EZ is the expectation with respect to frailty Z. Unknown vector parameter ζ=(pa, pb, r1, q1, r2, q2) characterizes the frailty distribution. In the case when clusters are dizygotic twin pairs and longevity is regulated by two dominant genes, we can rewrite the last formula as follows:

Here, Gm and Gf are the maternal and paternal genotypes, respectively. The frequencies of these genotypes P(Gm|ζ) and P(Gf|ζ) and the twin frailties Zi1j(G) and Zi2j(G), j=1,...,2 are given in Table 1. The segregation ratios of the mating types P(G|Gm,Gf) (the proportions of the different genotypes in the offspring of all mating types) can be calculated using the law of segregation. In this formula, the second with third lines and the fourth with fifth lines stand for expected survivals and probability densities given parental genotypes for the first and the second twin, respectively. We assumed here that both the twins inherited their genes from their parents independently. In the first line, we take the average for all the possible parental genotypes.

The monozygotic twins have identical genotypes and the likelihood function has a form

The full likelihood is as follows:

We can directly find the estimate of the unknown vector parameters (β, ω, ζ) maximizing the likelihood function L(X, L|U, δ; β, ω, ζ) with respect to (β, ω, ζ). Using the likelihood ratio, we can test different hypothesis about the parameters of the model.

If the underlying cause-specific hazard functions are not parametrically defined, the nonparametric maximum likelihood estimators can be used to estimate the vector parameter β and the cause-specific cumulative hazard functions using the EM algorithm.14, 16

Linkage analysis

After the unknown parameters (β, ω, ζ) have been estimated, we can try to locate the positions of longevity genes in the genome. Both monozygotic and dizygotic twin pairs contribute to the likelihood L(X, L|U, δ; β, ω, ζ). On the contrary, only the dizygotic twin pairs contribute to the likelihood function for determining the position of longevity genes. The search for linkage between two loci is based on calculating the Lod score, , proposed by Barnard17 and used later by Morton18 in sequential test procedures of the null hypothesis θ0=0.5 versus an alternative value θ<0.5. Here θ is the probability of recombination between two loci. Lander and Green19 proposed an algorithm based on the hidden Markov chain concept to calculate the multipoint pedigree likelihood. This algorithm was later modified and optimized by Kruglyak and Lander,20 Kruglyak et al.21 The hidden Markov chain algorithm is used to calculate the probability of the extended vector of markers (MsE), s=1, ..., S+Ng (this vector consists of S observed markers M and Ng non-observed major genes) taking into account the Markov property of a pair (MsE, Vs), where (Vs) are inheritance vectors. In experiments with twins, an inheritance vector Vs=(Vs1, Vs2, Vs3, Vs4)' is a binary vector at each locus s, s=1, ..., S+Ng, having four components. The first and the second components stand for alleles inherited in this locus by a twin, and the third and the fourth ones characterize alleles inherited by its co-twin. It is assumed that the first and the third components denote alleles inherited from the mother (0 if from the grandmother and 1 if from the grandfather). The second and the fourth components stand for alleles inherited from the father (same rules).

Given the location of the major genes at the chromosome, we calculate the Lod score for dizygotic twin pairs using the value

Here (G,M) is an extended genotype of a twin including genetic markers and longevity genes. A vector parameter characterizes the location of longevity genes in the genome (for example, the recombination distance from respective neighboring markers). The details about the calculation of the extended segregation ratios P(G, M|Gm, Gf;θ) for twins can be found elsewhere.13 Finally we calculate the Lod score in the form

where θ0 stands for longevity genes located at the recombination distance of 0.5 from all markers. We position the longevity loci at different places of a chromosome between respective markers, and calculate in each case the Lod score values. This Lod score profile can be used for testing the linkage. In accordance with current standard criterion for declaring linkage, we construct a support interval containing all the points where the Lod score is higher than or equal to 3. To exclude the linkage, we regard the points where the Lod score is less than or equal to −2.

Simulation study

In order to investigate the properties of the model described above, we have generated three types of data sets. In the first data set type, both longevity genes were situated on the same chromosome with markers. In the second type of data set, the first longevity gene was situated on the same chromosome with markers and the second one out of this chromosome. In the third type of generated data sets, both longevity genes were situated out of the chromosome with markers. In all experiments, non-censored survival data for 1000 dizygotic twin pairs were controlled by two correlated competing mortality risks without observed covariates, and the earliest failure time of each twin has been chosen. For univariate cause-specific survival functions Sj(x), j=1, 2, we have used the gamma-Gompertz parameterization

Here , , bj>0, cj>0, sj2>0 are unknown parameters. That is, the cause-specific survivals are equal to the cause-specific survival in a population of individuals which has survived to age 30 years (left truncation at age 30) with underlying hazards and gamma-distributed frailty with mean 1 and variance sj2>0 at age 30 years. For survival data generation, we put bj=2.5 × 10−5, cj=0.1, sj2=0.01, j=1, 2. Given the parameters ζ, bj, cj, sj2, j=1, 2, the cause-specific underlying cumulative hazard functions H0j(x) can be found as follows: In the first step, we calculate and Sj(x) for given the age x using formulas and . Then we calculate P(Gm|ζ), P(Gf|ζ) and P(G|Gm, Gf) using assumptions of the model and the law of segregation. Finally, we find H0j(x) from the formula: using a simple bisectional procedure. The functions H0j(x) are used to calculate the likelihood and the Lod score values. To generate the survival data, we have set pa=pb=0.5 for longevity allele frequencies, r2=q1=0.1 and either r1=q2=0.1 or r1=q2=0.05 for mortality risks. It is not difficult to show that the vector parameter ζ=(pa, pb, r1, q1, r2, q2)=(0.5, 0, 5, 0.1, 0.1, 0.1, 0.1) in the model with two dominant major genes and two causes of death with equal hazard functions corresponds to the vector of frailties (Z1, Z2, Z3)=(1.0, 0.55, 0.01) with probabilities (P1, P2, P3)=(0.25, 0.5, 0.25) in the model with one major gene and one cause of death. We can compare these values with results obtained in the longevity study of Danish twins using the major gene model [8]. In the case of autosomal locus with multiplicative action of one beneficial allele, it was found that with probabilities . The frailty Z3=0.01 in our study is substantially smaller than the value of frailty , but this effect is caused by the assumption about multiplicative action of beneficial alleles.

For genetic data generation, we assumed that 10 genetic markers were uniformly distributed over a chromosome with a distance of 5 cM between neighboring markers. Each gene at the marker locus can be characterized by a pair from the set of 10 different alleles and each allele can be met in the population with a frequency equal to 0.1. In the first type of simulated data, the first longevity gene was situated in the middle between the third and fourth markers and the second longevity gene was situated in the middle between the seventh and eighth markers. In the second type of simulated data, the first longevity gene was situated in the middle between the fifth and sixth markers, whereas the second longevity gene was out of this chromosome. In addition, we have assumed that the observed markers and longevity genes are in linkage equilibrium and in Hardy–Weinberg equilibrium. The peaks for Lod scores were counted to calculate the statistical power and the type I error rate. We used a Lod score threshold of 3 as an indicator of linkage.

Results

In Table 2, the results for parameter estimates bj, cj, sj, j=1, 2 and p, p1, q, q1, r, r1 based on 100 simulated data sets are given. Table 2 includes the empirical means and s.d.’s of the estimates. Taking into account s.d.’s, all parameter estimates are in agreement with true values. The Lod score profiles averaged over all simulations are shown in the Figures 1, 2, 3, 4, 5, 6. A twofold decrease in the frailties r1, q2 can substantially increase the Lod score and the beneficial action of the first longevity gene, if this gene is situated on the same chromosome with markers (see Figure 1). On the contrary, this leads to a decrease of the Lod score if both longevity genes are situated out of the chromosome with markers (Figure 2). In this case, we observe background Lod score profiles without clear peaks. The coefficient of the correlation between life spans of siblings has not increased significantly (0.23±0.03 versus 0.26±0.03) by a twofold decrease in the frailties r1, q2. In summary, the decreased action of the longevity gene can improve the chances to reject or accept the hypothesis that the longevity gene is situated on the chromosome with markers. In experiments with two longevity genes situated on the same chromosome with markers, we observe two clear peaks situated symmetrically on the plots (see Figures 3 and 4). As expected, the heights of the peaks are similar, if the actions of both longevity genes are also similar (see Figure 4). In other words, we cannot distinguish between the first and the second longevity genes using the Lod score profile. The stronger the action of the longevity gene, the higher the respective peak in the Lod score profile. If we look at Figure 3, we can conclude that the first and the second longevity genes are most probably located on the first and on the second half of the chromosome with markers, respectively. The background Lod score profiles for both cases are shown in Figures 5 and 6. It seems that the values of the Lod score for these profiles depend only on the value of the distance between the possible positions of the first and the second longevity genes and on the distance to neighboring markers. In the neighborhood of the markers, the values of the Lod score are slightly smaller. We assessed the power of the likelihood ratio test when comparing true hypothesis H1 (the data set was generated using the model with two dominant major genes and vector parameter ζ1=(pa, pb, r1, q1, r2, q2), pa=pb=0.5, r1=q1=r2=q2=0.1) and false null hypothesis H0 (the data set was generated using the model with one dominant major gene and vector parameter ζ0=(pa, r1, q2), pa=0.5, r1=q2=0.1). The power in this experiment was equal to 0.92.

Table 2 Summary of simulation results for unknown parameters
Figure 1
figure 1

Lod score profile. The first longevity gene is situated between the fifth and sixth chromosomes. The second longevity gene is out of the chromosome. pa=pb=0.5; r2=q1=0.1.

Figure 2
figure 2

Lod score profile. Both longevity genes are out of the chromosome. pa=pb=0.5; r2=q1=0.1.

Figure 3
figure 3

Contour map of the Lod score profile (smoothed). The first longevity gene was situated in the middle between the third and fourth markers and the second longevity gene was situated in the middle between the seventh and eighth markers. Power>0.99. pa=pb=0.5; r2=q1=0.1; r1=q2=0.05.

Figure 4
figure 4

Contour map of the Lod score profile (smoothed). The first longevity gene was situated in the middle between the third and fourth markers and the second longevity gene was situated in the middle between the seventh and eighth markers. Power=0.89. pa=pb=0.5; r2=q1=0.1; r1=q2=0.1.

Figure 5
figure 5

Contour map of the Lod score profile. Both longevity genes are out of the chromosome. Type I error<0.01. pa=pb=0.5; r2=q1=0.1; r1=q2=0.05.

Figure 6
figure 6

Contour map of the Lod score profile. Both longevity genes are out of the chromosome. Type I error<0.01. pa=pb=0.5; r2=q1=0.1; r1=q2=0.1.

If the data set was generated under the assumption of a single major gene and the Lod score profile was calculated using the present method, we will observe a peak situated near the diagonal θ1=θ2. Finally, we have calculated the Lod score profile (averaged over 100 simulations) using the model with a single dominant major gene applied to the data set generated using the model with two dominant major genes (both situated on the same chromosome—in the middle between the third and fourth markers and in the middle between the seventh and eighth markers) (Figure 7). In this case, we do not observed clear peaks, but the plateau. The area, where the Lod score is greater than 3, extends ∼32 cM.

Figure 7
figure 7

Lod score profile. The first longevity gene was situated in the middle between the third and fourth markers and the second longevity gene was situated in the middle between the seventh and eighth markers (pa=pb=0.5; r2=q1=0.1; r1=q2=0.1). The Lod score profile was calculated under the assumption of a single dominant major gene model.

Discussion

In a previous paper13, a two-step procedure was used to estimate the parameters of univariate fit, frailty distribution and the location of the longevity gene. In this paper, we extended this method for the case of two longevity genes and two correlated competing risks of mortality. The presence of longevity genes in the genome can be tested in the first step. In the second step, we locate the position of these genes in the genome. There is no problem to extend the case of twin data with two related individuals to the case of family data with more than two related individuals if we use the model based on the major gene concept. The major gene model makes it possible to take into account not only the correlation between individuals but also the correlations between competing risks of mortality using parameters q1 and q2. The covariates such as age, gender, disease status and so on, which influence the risk of mortality can be easily taken into account in the form of the Cox-type regression. The unknown regression coefficients can be estimated in the first step together with parameters bj, cj, sj, j=1, 2 and pa, pb, q1, q2, r1, r2. From our experiments with simulated data sets, we see that parameters pa, pb, q1, q2, r1, r2 influence the values of Lod scores and heights of possible peaks. The smaller the values of q1, q2, r1, r2, the higher the peaks of the Lod scores and the greater the possibility of longevity genes detection and localization. In principle, we can extend this model to one with mixed frailty by including an additional continuously distributed component of frailty. This continuous component will measure the averaged influence on mortality of a large number of genes and environment. However, it is not reasonable for sample sizes used in this study, as the bivariate probability density functions for the model with discretely and continuously distributed frailties are very similar.8 In some cases, we can detect the presence of model misspecification analyzing the Lod score profiles. For example, large plateau in experiments with a single major gene can indicate the presence of two major genes situated on the same chromosome. On the contrary, if a single major gene influences the lifespan and the present method is used, we will observe a peak of Lod score situated near the diagonal.