Introduction

Using unrelated subjects in a case–control study is a popular design for testing association between genetic markers and phenotypes. Spurious association may occur due to migration, nonrandom mating or population admixture. In order to avoid spurious evidence of association, Falk and Rubinstein1 proposed the Haplotype Relative Risk (HRR), which uses case–parent triads, as a method to test linkage disequilibrium (LD) between a marker and a putative disease locus. The HRR compares parental marker alleles transmitted to an affected offspring to those not transmitted as a test for association. When population admixture is present, HRR is conservative resulting in reduced power for testing associations. Spielman et al2 suggested the transmission/disequilibrium test (TDT) to adjust for population admixture using a matched study design. TDT examines if heterozygous parents preferentially transmit certain alleles to an affected offspring.

Although family-based triads are robust to population admixture, the collection of parental genotypes is often difficult because of death or refusal to participate. Family-based association tests, such as the HRR and TDT, are generally not applicable when parental genotypes are not complete. Curtis and Sham3 showed that the estimate of the probability of transmission of certain alleles is biased in the TDT when one parent is missing, and only heterozygous parents and homozygous offspring contribute to the test. Assuming that parental genotypes are missing completely at random (MCAR), such bias is avoided by the 1-TDT test (TDT with only one parent) proposed by Sun et al,4 a test that uses genotypes of the affected offspring and the one available parent, but excludes affected offspring where both the offspring and the parent are heterozygous. In addition, several other strategies with the same assumption have been proposed to accommodate incomplete triads.5, 6, 7

Allowing for informative missingness of parental genotypes, Allen et al8 and Chen9 proposed valid tests incorporating incomplete triads. However, their strategies are less powerful when the missing pattern was indeed MAR or MCAR. For example, in Chen's Table 4,9 the power of the 1-d.f. score statistic is less than that of TDT using intact triads only for a common (rare) allele under the dominant (recessive) disease model. So is the 2-d.f. score statistic for both rare and common variant alleles under the multiplicative inheritance. This means that the inclusion of dyads (incomplete triads with only one parental genotype) reduces the power of the score test in these cases.

Regardless of different missing data patterns among parental genotypes, the above methods assumed that offspring genotypes were MCAR. Recently, Guo10 derived the conditional distribution of ascertained triads that allows informative missingness for offspring genotypes, as well as their parental genotypes, and evaluated several tests under such scenarios. Guo10 indicated that when offspring genotypes were missing informatively, a circumstance that can be considered as ascertainment bias, inflated type-I error and/or reduced power may occur using the TDT excluding incomplete triads. Therefore, if the missing data pattern for offspring genotypes is not confirmed to be MCAR, a significant result from the TDT using only intact triads does not assure true association between the marker and a putative disease locus.

In an effort to assure a valid conclusion, we introduce a new test called Testing Informative Missingness (TIM) to determine whether the missing data pattern in ascertained triads is informative or not.

Statistical method

We derived the conditional distribution of ascertained triads and dyads (Table 1) in the Appendix 1.11 Note that Pki,j and Mki,j represent the theoretical probability and observed counts for each type of triad data. k=‘0’, ‘1’ or ‘2’ represents the total number of B1 alleles transmitted to the offspring, and i, j=‘0’, ‘1’ or ‘2’ represents the ordered total number of B1 alleles for fathers and mothers, respectively. Note that we use the superscript ‘*’ to denote that the parental genotype is missing.

Table 1 Conditional distribution of ascertained triads and dyads

Based on Table 1, we calculated the conditional distribution of parental genotypes among triads and dyads, displayed in Table 2. Under the null hypothesis of MCAR, conditional on offspring genotypes, the distribution of parental genotypes among triads and dyads are identical. Therefore, a logistic regression approach can be implemented to test for informative missingness of parental genotypes.

Table 2 Conditional distribution of parental genotypes among triads and dyads

Let the outcome variable Y be 1, if the parental genotype is from a complete triad and Y=0, if the parental genotype is from a dyad. Parents of a triad contribute two independent observations and the available parent among dyads contributes one observation. By choosing genotype B1B1 to be the reference group, let the first (second) dummy variable of parental genotype be D1=1 (D2=1) if the parental genotype is B1B2 (B2B2); otherwise D1=0 (D2=0). Similarly, let the first (second) dummy variable of offspring genotype G1=1 (G2=1) if the offspring genotype is B1B2 (B2B2); otherwise G1=0(G2=0). As a result, conditional on the affected offspring genotypes, the distribution of parental genotypes among triads can be compared to that of dyads by the logistic model as

where

Under the null hypothesis that the genotypes of ascertained triads and dyads are MCAR, the null hypothesis of TIM is β D 1  = β D 2  = 0 |G1, G2, which states that the distributions of parental genotypes are identical among triads and dyads controlling for offspring genotypes.

Scenarios under missing at random

Little and Rubin12 indicated that MCAR means that the cause of missingness is unrelated to the items and the observed values from a random subsample of the sampled value. Missing at random (MAR) means that the probability of a missing value for an outcome depends on the observed responses of other covariates, but given these, it does not depend on the missing value itself. Within subgroups formed by the observed covariates on which the missingness depends, the data are MCAR. Therefore, scenarios under MAR are also considered as the null distribution, since it becomes MCAR by adjusting for available covariates related to the missing data mechanism.

For example, suppose you are modeling weight (Y) as a function of sex (X). Some respondents would not disclose their weight, so you are missing some values for Y. One sex may be less likely to disclose its weight. That is, the probability that Y is missing depends only on the value of X. Such data are MAR and weight conditional on sex (YX) is MCAR. Therefore, the data can be considered as MCAR within subgroups formed by the observed items (covariates) on which the missingness depends. Here, let the covariates be X1, X2,…,XK and the logistic model is

The null hypothesis of MCAR is β D 1  = β D 2  = 0 |G1, G2, X1, X2,…, XK. Therefore, even when the missing data mechanism is MAR but not MCAR, the TIM does not reject the null hypothesis when covariates related to missingness X1, X2,…,XK are taken into account.

Simulations

We first assumed that the population is free from population stratification. Let ‘a’ and ‘A’ denote the disease and normal allele. Let D denote that an individual is diseased. Let f denote the probability of being affected when an individual carries 0 risk alleles (the phenocopy rate), and let K denote the genotype relative risk (GRR). For a recessive disease model, the penetrance functions are P(DAA)=P(DAa)=f and P(Daa)=K × f, where 0≤f≤1 and 0≤K × f≤1. The disease prevalence is determined by these probabilities and the risk allele frequency. Similarly, for a dominant disease model, P(DAA)=f and P(DAa)=P(Daa)=K × f. For an additive disease model, P(DAA)=f; P(DAa)=K × f; P(Daa)=2 × K × f−1(K>0.5). We considered additive, recessive, and dominant disease models in our simulations and the affection status of each individual is determined according to these parameters.

We simulated a general population where nuclear families have exactly one offspring. We randomly assigned each offspring, father, and mother to be missing according to various probabilities indicated in each table. Therefore, only a proportion of families with an affected offspring were eligible for the study and a total of 500 families were sampled.

Several disease allele (denoted a) and marker allele (denoted B1) frequencies were examined. A range of possible values for the disequilibrium coefficient δ=p(aB1)−p(a)p(B1) and recombination fraction θ were simulated. In the tables and figures displayed, the frequencies of the disease and marker alleles, the disease model, phenocopy rate, and the penetrance are indicated in each table.

We repeated the simulation 10 000 (1000) times to examine type-I error (power) of several tests examined including the TIM. Under the null hypothesis of MCAR, the fraction of times that the test statistic exceeds the critical value, defined by the asymptotic distribution of the statistic, is the type-I error. The power of each test is the proportion of test statistics in the total number of simulations exceeding the critical value under the alternative hypothesis. Type-I error and power of several LD tests were also evaluated under the various patterns of parental genotype missingness.

In a second set of simulations we introduced population stratification by sampling two populations with expected samples sizes reflecting different disease frequencies in the subpopulations. For example, for a pure recessive model, if the disease allele frequencies of the two populations are 0.3 and 0.6, respectively, then 9% of the first and 36% of the second population would be affected and sampled. Therefore, we will expect 20% and 80% of the sample to come from the first and second populations, respectively. This is the ratio we would observe in most admixed samples. Because the disease allele frequencies are different in the two populations, the frequencies of diseased individuals in the two samples are also different.

In Tables 3, 4, 5 and 6, the column marked ‘TDT’ reports results using the traditional TDT2 test on the subset of complete triads only. The columns marked ‘1-TDT’ and ‘EM-HRR’ (expectation maximization algorithm based haplotype relative risk)7 use both the complete triads and dyads. The column marked ‘TIM’ is the test of informative missingness.

Table 3 Type-I error (%) of TIM at α=0.05 based on 10 000 replicates
Table 4 Type-I error (%) of TIM at α=0.05 under population admixture based on 10 000 replicates
Table 5 Power (%) of TIM at α=0.05 with no linkage (θ=0.5) and no association (δ=0) based on 1000 replicates
Table 6 Power (%) of TIM at α=0.05 with linkage (θ=0) and association (δ=0.05) based on 1000 replicates

Results

When genotypes of ascertained offspring and parents are MCAR, the type-I errors of TIM in a homogeneous population are displayed in Table 3. Both the disease and marker allele frequencies are 0.3. The disease penetrance and phenocopy rate are 0.4 and 0.2, respectively. Different disease and marker allele frequencies, penetrance, and phenocopy rates yielded similar results, not shown here. The underlying disease model (dominant, additive, or recessive) is indicated in the first column. The second and third columns are the recombination fraction θ and disequilibrium coefficient δ. The three missing rates for fathers, mothers, and offspring are displayed in the first, second, and third number of the parenthesis in the fourth column. The three missing rates may differ. However, each of the three missing rates is identical for all genotypes B1B1, B1B2, and B2B2, such that the missing patterns are considered as MCAR. When there is no linkage (θ=0.5) and no association (δ=0), the TDT, 1-TDT, and EM-HRR have expected 5% chance of rejecting the null hypothesis. When there is linkage (θ=0) and association (δ=0.05), the power of TDT, 1-TDT, and EM-HRR are displayed in the bottom rows of Table 3. One can see that TIM has expected 5% type-I error regardless of the relationship between the marker and the disease alleles (independent of values of θ and δ).

When genotypes of ascertained offspring and parents are MCAR in an admixed population, the type-I error of TIM is displayed in Table 4. The disease allele (minor marker allele) frequency for the first and second populations are 0.2 and 0.6 (0.4 and 0.3), respectively. The disease penetrance and phenocopy rates are 0.4 and 0.2, respectively. Since TDT, 1-TDT, and EM-HRR are robust to population stratification, all tests have expected 5% error rates when there is no linkage (θ=0.5) and no association (δ=0). Under the alternative hypothesis with linkage (θ=0) and association (δ=0.05), TDT has the lowest power due to the exclusion of dyads in the analysis. Therefore, the 1-TDT is more powerful than TDT and EM-HRR has the highest power for detecting linkage and association, matching previous reports.4, 7 Since TIM has expected 5% type-1 error in the extreme scenarios we simulated, TIM is also robust to population admixture.

Simulation results displayed in Tables 5 and 6 are circumstances under which genotypes of trios are missing informatively in a homogeneous population. The disease and marker allele frequencies are 0.3 and 0.4, respectively. The disease penetrance and phenocopy rate are 0.4 and 0.2, respectively. Within each disease model, we first display four scenarios where informative missingness of genotypes occurred in parents only and genotypes of ascertained offspring are MCAR (20% missing rates for all genotypes). Secondly, we introduced informative missingness for genotypes of ascertained offspring as well as their parents and the results are displayed in rows 5–8 within each disease model.

In Table 5, when there is no association (δ=0) and no linkage (θ=0.5) and offspring genotypes are MCAR (rows 1–4 in each disease model), the TDT using the subset of complete triads remains a valid test for linkage and association with expected 5% type-I error. The 1-TDT and EM-HRR, using both triads and dyads, had inflated type-I errors and the inflation increased with respect to magnitude and pattern of informative missingness. TIM has better power to detect informative missingness in the more frequent parental genotypes (B2B2) than the less frequent genotypes (B1B1). When offspring genotypes are missing informatively (rows 5–8 in each disease model), by excluding dyads from the analysis, TDT is no longer valid for testing linkage and association. However, incorporation of dyads and monads reduced such biases.10 The simulation results suggest that the TIM maintains good power when offspring genotypes are also missing informatively.

In Table 6, when there was association (δ=0.05) and linkage (θ=0), the power of the 1-TDT and EM-HRR can be lower or higher than TDT, suggesting that incorporating dyads either dampened or inflated the power of those tests when the MCAR assumption was violated, matching the investigations by Guo et al.10, 13 In the scenarios we examined, the TIM has decent power to detect informative missingness and its performance is closely related to the missing data pattern. Power of TIM was not confounded by linkage and association between the disease locus and the marker.

Application to the Framingham Heart Study

The Framingham Heart Study began in 1948 with the enrollment of 5209 men and women.14, 15 In 1971, 5124 men and women were enrolled into the Framingham Offspring Study, which included the offspring (and their spouses) of the original cohort. Offspring participants underwent examinations approximately every 4 years; the design and methodology have been previously described.16, 17 The sample analyzed was comprised of Framingham Offspring Study participants who attended the sixth examination cycle between 1995 and 1998 and the apolipoprotein E (apoE) genotypes of the first generation cohort. The Framingham Heart Study protocol is approved by the Boston Medical Center Institutional Review Board and all participants provided written informed consent.

Ordovas et al18 reported evidence for association of the apoE isoform with elevated total cholesterol (TC) levels in the Framingham Heart Study. Jarvik et al19 addressed the possible influence of apoE genotype on age-related changes in TC from a male twin longitudinal study. Several studies of unrelated subjects also reported association between the apoE gene and TC.20, 21, 22, 23, 24

Because genotypes of the first generation cohort were collected nearly 40 years after the initialization of the study, it has been questioned whether the missing data pattern of parental genotypes of the Framingham Heart Study was affected by potential survival bias. Therefore, we applied TIM to the relation of elevated total cholesterol and APOE genotype. The apoE gene has three common alleles, which are apoE2, apoE3, and apoE4 and genotype frequencies are 0, 0.089, 0.021, 0.642, 0.223, and 0.025 for E2/E2, E2/E3, E2/E4, E3/E3, E3/E4, and E4/E4, respectively. We adopted a similar approach implemented by Guo et al25 to combine the rare allele apoE2 (associated with low TC) with the major allele apoE3 to compare those with at least one apoE4 allele (associated with high TC) to those without any. Of the 3532 participants attending the sixth offspring examination, there were 1041 individuals with at least one parental APOE genotype. Among the 1044, there were 472 with elevated total cholesterol (greater than 200 mg per 100 ml), deriving from 427 independent nuclear families. Therefore, there were 229 dyads and 198 triads included in the analysis.

In Table 7, we display the distribution of parental genotypes among dyads and triads by offspring genotypes. The logistic regression of missing status yielded a P-value of 0.8624 for parental genotype adjusting for offspring genotype. Therefore, there was no statistically significant evidence of informative missingness of parental APOE genotypes in the families with offspring who have elevated total cholesterol at the Framingham Heart Study.

Table 7 Distribution of APOE genotypes

Discussion

For the case–parent triads design, the TDT2 cannot include families with incomplete parental genotypes. Approaches such as the 1-TDT4 and EM-HRR7 were designed to include such families due to missingness related to a disease in ascertainment, not to genotyping failure. They may be more powerful than the TDT2 but are valid only if the missingness is not informative, that is, missingness independent of the underlying genotype (MAR). Although approaches proposed by Allen et al8 and Chen9 can include incomplete triads and are valid under informative missingness, they may not be as powerful as 1-TDT4 and EM-HRR7 when the missing data pattern is truly MCAR. Regardless of different missing data patterns among parental genotypes, the above methods assumed that offspring genotypes were MCAR. Recently, Guo10 indicated that when offspring genotypes were missing informatively, a circumstance that can arise from ascertainment bias, inflated type-I error and/or reduced power may occur using the TDT excluding incomplete triads.

The purpose of this work is to provide a test for informative missingness in the context of case–parent triad designs for genetic linkage and/or association studies, in an effort to avoid a biased conclusion. Our approach compares the parental genotype distribution in triads to that of dyads conditional on the genotypes of affected offspring. Differential parental genotype distributions in triads and dyads indicate that parental genotypes are missing informatively. We have shown, through theoretical derivations and computer simulations, that TIM is not affected by linkage (θ) or association (δ). It provides expected 5% type-I error at α=0.05 level under MCAR and is robust to population admixture. Simulation results suggest that TIM has adequate power to test informative missingness in moderately sized sample. In the logistic regression framework, TIM remains a valid test under MAR by conditioning on available covariates X1, X2,…,XK related to missingness.

Given a significant TIM result, assuming that informative missingness exists only in parental genotypes due to, for example, a late onset fatal disease such as cardiovascular disease, Allen et al8 and Chen's9 strategies are recommended to incorporate dyads. Otherwise, the 1-TDT4 and EM-HRR7 are appropriate and may provide higher power. However, one should be aware of the basic assumption of absence of ascertainment bias in all TDT/family-based association tests. If TIM is significant for an early onset fatal disease, one should be aware that none of existing methods yields a valid result, including the TDT with only complete trios, as illustrated and discussed by Guo.10

The proposed test TIM is developed for case–parent triads designs. However, it is readily applicable to other designs consisting of parents and affected offspring by selecting triads and dyads from the data. But this may not be the most powerful approach due to deletion of sibling information. Therefore, our future work will extend TIM to consider more general pedigrees. Many recently published genome-wide association studies (GWAs) are case–control designs, but there are also family-based GWAs with available genotyping on parent–offspring trios, such as the Framingham Heart Study and International Multi-Center ADHD Genetics Project (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap). It is likely that more family-based GWA studies using DNA already collected in the past emerge in the near future. The advantage of parent–offspring design over case–control design is that the former design is immune for population admixture, and it tests for linkage as well as association. Therefore, our method may be useful for these designs to further reduce false positives due to informative missingness in the modern era of genome-wide studies.