Testing informative missingness in genetic studies using case–parent triads

Guo, Chao-Yu; Cupples, Laura Adrienne; Yang, Qiong

doi:10.1038/ejhg.2008.38

Download PDF

Article
Published: 12 March 2008

Testing informative missingness in genetic studies using case–parent triads

Chao-Yu Guo^1,2,3,
Laura Adrienne Cupples⁴ &
Qiong Yang⁴

European Journal of Human Genetics volume 16, pages 992–1001 (2008)Cite this article

540 Accesses
4 Citations
Metrics details

Abstract

In genetic studies, the transmission/disequilibrium test (TDT) using case–parent triads has gained popularity attributable to its robustness to population admixture. Several extensions have been proposed to accommodate incomplete triads. Some strategies assume that parental genotypes are missing completely at random (MCAR) to insure an unbiased conclusion and some methods allow parental genotypes to be missing informatively, resulting in reduced power when the missing data pattern is indeed MCAR. However, these tests assumed that offspring genotypes were MCAR. Recently, Guo indicated that when offspring genotypes were missing informatively, an occurrence that can be considered as ascertainment bias, inflated type-I error and/or reduced power may occur using the TDT when incomplete triads are excluded. In an effort to avoid an erroneous conclusion, we propose a strategy called testing informative missingness (TIM) that compares conditional distributions of parental genotypes among complete triads and incomplete data with only one parent to examine the missing data pattern. Through computer simulations, TIM has decent power to detect informative missingness and is robust to population admixture. In addition, we illustrate TIM with an application to the Framingham Heart Study.

Genome-wide association studies

Article 26 August 2021

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Article Open access 12 April 2024

Introduction

Using unrelated subjects in a case–control study is a popular design for testing association between genetic markers and phenotypes. Spurious association may occur due to migration, nonrandom mating or population admixture. In order to avoid spurious evidence of association, Falk and Rubinstein¹ proposed the Haplotype Relative Risk (HRR), which uses case–parent triads, as a method to test linkage disequilibrium (LD) between a marker and a putative disease locus. The HRR compares parental marker alleles transmitted to an affected offspring to those not transmitted as a test for association. When population admixture is present, HRR is conservative resulting in reduced power for testing associations. Spielman et al² suggested the transmission/disequilibrium test (TDT) to adjust for population admixture using a matched study design. TDT examines if heterozygous parents preferentially transmit certain alleles to an affected offspring.

Although family-based triads are robust to population admixture, the collection of parental genotypes is often difficult because of death or refusal to participate. Family-based association tests, such as the HRR and TDT, are generally not applicable when parental genotypes are not complete. Curtis and Sham³ showed that the estimate of the probability of transmission of certain alleles is biased in the TDT when one parent is missing, and only heterozygous parents and homozygous offspring contribute to the test. Assuming that parental genotypes are missing completely at random (MCAR), such bias is avoided by the 1-TDT test (TDT with only one parent) proposed by Sun et al,⁴ a test that uses genotypes of the affected offspring and the one available parent, but excludes affected offspring where both the offspring and the parent are heterozygous. In addition, several other strategies with the same assumption have been proposed to accommodate incomplete triads.^{5, 6, 7}

Allowing for informative missingness of parental genotypes, Allen et al⁸ and Chen⁹ proposed valid tests incorporating incomplete triads. However, their strategies are less powerful when the missing pattern was indeed MAR or MCAR. For example, in Chen's Table 4,⁹ the power of the 1-d.f. score statistic is less than that of TDT using intact triads only for a common (rare) allele under the dominant (recessive) disease model. So is the 2-d.f. score statistic for both rare and common variant alleles under the multiplicative inheritance. This means that the inclusion of dyads (incomplete triads with only one parental genotype) reduces the power of the score test in these cases.

Regardless of different missing data patterns among parental genotypes, the above methods assumed that offspring genotypes were MCAR. Recently, Guo¹⁰ derived the conditional distribution of ascertained triads that allows informative missingness for offspring genotypes, as well as their parental genotypes, and evaluated several tests under such scenarios. Guo¹⁰ indicated that when offspring genotypes were missing informatively, a circumstance that can be considered as ascertainment bias, inflated type-I error and/or reduced power may occur using the TDT excluding incomplete triads. Therefore, if the missing data pattern for offspring genotypes is not confirmed to be MCAR, a significant result from the TDT using only intact triads does not assure true association between the marker and a putative disease locus.

In an effort to assure a valid conclusion, we introduce a new test called Testing Informative Missingness (TIM) to determine whether the missing data pattern in ascertained triads is informative or not.

Statistical method

We derived the conditional distribution of ascertained triads and dyads (Table 1) in the Appendix 1.¹¹ Note that P_k^i,j and M_k^i,j represent the theoretical probability and observed counts for each type of triad data. k=‘0’, ‘1’ or ‘2’ represents the total number of B₁ alleles transmitted to the offspring, and i, j=‘0’, ‘1’ or ‘2’ represents the ordered total number of B₁ alleles for fathers and mothers, respectively. Note that we use the superscript ‘^*’ to denote that the parental genotype is missing.

Table 1 Conditional distribution of ascertained triads and dyads

Full size table

Based on Table 1, we calculated the conditional distribution of parental genotypes among triads and dyads, displayed in Table 2. Under the null hypothesis of MCAR, conditional on offspring genotypes, the distribution of parental genotypes among triads and dyads are identical. Therefore, a logistic regression approach can be implemented to test for informative missingness of parental genotypes.

Table 2 Conditional distribution of parental genotypes among triads and dyads

Full size table

Let the outcome variable Y be 1, if the parental genotype is from a complete triad and Y=0, if the parental genotype is from a dyad. Parents of a triad contribute two independent observations and the available parent among dyads contributes one observation. By choosing genotype B₁B₁ to be the reference group, let the first (second) dummy variable of parental genotype be D₁=1 (D₂=1) if the parental genotype is B₁B₂ (B₂B₂); otherwise D₁=0 (D₂=0). Similarly, let the first (second) dummy variable of offspring genotype G₁=1 (G₂=1) if the offspring genotype is B₁B₂ (B₂B₂); otherwise G₁=0(G₂=0). As a result, conditional on the affected offspring genotypes, the distribution of parental genotypes among triads can be compared to that of dyads by the logistic model as

where

Under the null hypothesis that the genotypes of ascertained triads and dyads are MCAR, the null hypothesis of TIM is $β_{D_{1}} {= β}_{D_{2}} = 0$ |G₁, G₂, which states that the distributions of parental genotypes are identical among triads and dyads controlling for offspring genotypes.

Scenarios under missing at random

Little and Rubin¹² indicated that MCAR means that the cause of missingness is unrelated to the items and the observed values from a random subsample of the sampled value. Missing at random (MAR) means that the probability of a missing value for an outcome depends on the observed responses of other covariates, but given these, it does not depend on the missing value itself. Within subgroups formed by the observed covariates on which the missingness depends, the data are MCAR. Therefore, scenarios under MAR are also considered as the null distribution, since it becomes MCAR by adjusting for available covariates related to the missing data mechanism.

For example, suppose you are modeling weight (Y) as a function of sex (X). Some respondents would not disclose their weight, so you are missing some values for Y. One sex may be less likely to disclose its weight. That is, the probability that Y is missing depends only on the value of X. Such data are MAR and weight conditional on sex (Y∣X) is MCAR. Therefore, the data can be considered as MCAR within subgroups formed by the observed items (covariates) on which the missingness depends. Here, let the covariates be X₁, X₂,…,X_K and the logistic model is

The null hypothesis of MCAR is $β_{D_{1}} {= β}_{D_{2}} = 0$ |G₁, G₂, X₁, X₂,…, X_K. Therefore, even when the missing data mechanism is MAR but not MCAR, the TIM does not reject the null hypothesis when covariates related to missingness X₁, X₂,…,X_K are taken into account.

Simulations

We first assumed that the population is free from population stratification. Let ‘a’ and ‘A’ denote the disease and normal allele. Let D denote that an individual is diseased. Let f denote the probability of being affected when an individual carries 0 risk alleles (the phenocopy rate), and let K denote the genotype relative risk (GRR). For a recessive disease model, the penetrance functions are P(D∣AA)=P(D∣Aa)=f and P(D∣aa)=K × f, where 0≤f≤1 and 0≤K × f≤1. The disease prevalence is determined by these probabilities and the risk allele frequency. Similarly, for a dominant disease model, P(D∣AA)=f and P(D∣Aa)=P(D∣aa)=K × f. For an additive disease model, P(D∣AA)=f; P(D∣Aa)=K × f; P(D∣aa)=2 × K × f−1(K>0.5). We considered additive, recessive, and dominant disease models in our simulations and the affection status of each individual is determined according to these parameters.

We simulated a general population where nuclear families have exactly one offspring. We randomly assigned each offspring, father, and mother to be missing according to various probabilities indicated in each table. Therefore, only a proportion of families with an affected offspring were eligible for the study and a total of 500 families were sampled.

Several disease allele (denoted a) and marker allele (denoted B₁) frequencies were examined. A range of possible values for the disequilibrium coefficient δ=p(aB₁)−p(a)p(B₁) and recombination fraction θ were simulated. In the tables and figures displayed, the frequencies of the disease and marker alleles, the disease model, phenocopy rate, and the penetrance are indicated in each table.

We repeated the simulation 10 000 (1000) times to examine type-I error (power) of several tests examined including the TIM. Under the null hypothesis of MCAR, the fraction of times that the test statistic exceeds the critical value, defined by the asymptotic distribution of the statistic, is the type-I error. The power of each test is the proportion of test statistics in the total number of simulations exceeding the critical value under the alternative hypothesis. Type-I error and power of several LD tests were also evaluated under the various patterns of parental genotype missingness.

In a second set of simulations we introduced population stratification by sampling two populations with expected samples sizes reflecting different disease frequencies in the subpopulations. For example, for a pure recessive model, if the disease allele frequencies of the two populations are 0.3 and 0.6, respectively, then 9% of the first and 36% of the second population would be affected and sampled. Therefore, we will expect 20% and 80% of the sample to come from the first and second populations, respectively. This is the ratio we would observe in most admixed samples. Because the disease allele frequencies are different in the two populations, the frequencies of diseased individuals in the two samples are also different.

In Tables 3, 4, 5 and 6, the column marked ‘TDT’ reports results using the traditional TDT² test on the subset of complete triads only. The columns marked ‘1-TDT’ and ‘EM-HRR’ (expectation maximization algorithm based haplotype relative risk)⁷ use both the complete triads and dyads. The column marked ‘TIM’ is the test of informative missingness.

Table 3 Type-I error (%) of TIM at α=0.05 based on 10 000 replicates

Full size table

Table 4 Type-I error (%) of TIM at α=0.05 under population admixture based on 10 000 replicates

Full size table

Table 5 Power (%) of TIM at α=0.05 with no linkage (θ=0.5) and no association (δ=0) based on 1000 replicates

Full size table

Table 6 Power (%) of TIM at α=0.05 with linkage (θ=0) and association (δ=0.05) based on 1000 replicates

Full size table

Results

When genotypes of ascertained offspring and parents are MCAR, the type-I errors of TIM in a homogeneous population are displayed in Table 3. Both the disease and marker allele frequencies are 0.3. The disease penetrance and phenocopy rate are 0.4 and 0.2, respectively. Different disease and marker allele frequencies, penetrance, and phenocopy rates yielded similar results, not shown here. The underlying disease model (dominant, additive, or recessive) is indicated in the first column. The second and third columns are the recombination fraction θ and disequilibrium coefficient δ. The three missing rates for fathers, mothers, and offspring are displayed in the first, second, and third number of the parenthesis in the fourth column. The three missing rates may differ. However, each of the three missing rates is identical for all genotypes B₁B₁, B₁B₂, and B₂B₂, such that the missing patterns are considered as MCAR. When there is no linkage (θ=0.5) and no association (δ=0), the TDT, 1-TDT, and EM-HRR have expected 5% chance of rejecting the null hypothesis. When there is linkage (θ=0) and association (δ=0.05), the power of TDT, 1-TDT, and EM-HRR are displayed in the bottom rows of Table 3. One can see that TIM has expected 5% type-I error regardless of the relationship between the marker and the disease alleles (independent of values of θ and δ).

When genotypes of ascertained offspring and parents are MCAR in an admixed population, the type-I error of TIM is displayed in Table 4. The disease allele (minor marker allele) frequency for the first and second populations are 0.2 and 0.6 (0.4 and 0.3), respectively. The disease penetrance and phenocopy rates are 0.4 and 0.2, respectively. Since TDT, 1-TDT, and EM-HRR are robust to population stratification, all tests have expected 5% error rates when there is no linkage (θ=0.5) and no association (δ=0). Under the alternative hypothesis with linkage (θ=0) and association (δ=0.05), TDT has the lowest power due to the exclusion of dyads in the analysis. Therefore, the 1-TDT is more powerful than TDT and EM-HRR has the highest power for detecting linkage and association, matching previous reports.^{4, 7} Since TIM has expected 5% type-1 error in the extreme scenarios we simulated, TIM is also robust to population admixture.

Simulation results displayed in Tables 5 and 6 are circumstances under which genotypes of trios are missing informatively in a homogeneous population. The disease and marker allele frequencies are 0.3 and 0.4, respectively. The disease penetrance and phenocopy rate are 0.4 and 0.2, respectively. Within each disease model, we first display four scenarios where informative missingness of genotypes occurred in parents only and genotypes of ascertained offspring are MCAR (20% missing rates for all genotypes). Secondly, we introduced informative missingness for genotypes of ascertained offspring as well as their parents and the results are displayed in rows 5–8 within each disease model.

In Table 5, when there is no association (δ=0) and no linkage (θ=0.5) and offspring genotypes are MCAR (rows 1–4 in each disease model), the TDT using the subset of complete triads remains a valid test for linkage and association with expected 5% type-I error. The 1-TDT and EM-HRR, using both triads and dyads, had inflated type-I errors and the inflation increased with respect to magnitude and pattern of informative missingness. TIM has better power to detect informative missingness in the more frequent parental genotypes (B₂B₂) than the less frequent genotypes (B₁B₁). When offspring genotypes are missing informatively (rows 5–8 in each disease model), by excluding dyads from the analysis, TDT is no longer valid for testing linkage and association. However, incorporation of dyads and monads reduced such biases.¹⁰ The simulation results suggest that the TIM maintains good power when offspring genotypes are also missing informatively.

In Table 6, when there was association (δ=0.05) and linkage (θ=0), the power of the 1-TDT and EM-HRR can be lower or higher than TDT, suggesting that incorporating dyads either dampened or inflated the power of those tests when the MCAR assumption was violated, matching the investigations by Guo et al.^{10, 13} In the scenarios we examined, the TIM has decent power to detect informative missingness and its performance is closely related to the missing data pattern. Power of TIM was not confounded by linkage and association between the disease locus and the marker.

Application to the Framingham Heart Study

The Framingham Heart Study began in 1948 with the enrollment of 5209 men and women.^{14, 15} In 1971, 5124 men and women were enrolled into the Framingham Offspring Study, which included the offspring (and their spouses) of the original cohort. Offspring participants underwent examinations approximately every 4 years; the design and methodology have been previously described.^{16, 17} The sample analyzed was comprised of Framingham Offspring Study participants who attended the sixth examination cycle between 1995 and 1998 and the apolipoprotein E (apoE) genotypes of the first generation cohort. The Framingham Heart Study protocol is approved by the Boston Medical Center Institutional Review Board and all participants provided written informed consent.

Ordovas et al¹⁸ reported evidence for association of the apoE isoform with elevated total cholesterol (TC) levels in the Framingham Heart Study. Jarvik et al¹⁹ addressed the possible influence of apoE genotype on age-related changes in TC from a male twin longitudinal study. Several studies of unrelated subjects also reported association between the apoE gene and TC.^{20, 21, 22, 23, 24}

Because genotypes of the first generation cohort were collected nearly 40 years after the initialization of the study, it has been questioned whether the missing data pattern of parental genotypes of the Framingham Heart Study was affected by potential survival bias. Therefore, we applied TIM to the relation of elevated total cholesterol and APOE genotype. The apoE gene has three common alleles, which are apoE2, apoE3, and apoE4 and genotype frequencies are 0, 0.089, 0.021, 0.642, 0.223, and 0.025 for E2/E2, E2/E3, E2/E4, E3/E3, E3/E4, and E4/E4, respectively. We adopted a similar approach implemented by Guo et al²⁵ to combine the rare allele apoE2 (associated with low TC) with the major allele apoE3 to compare those with at least one apoE4 allele (associated with high TC) to those without any. Of the 3532 participants attending the sixth offspring examination, there were 1041 individuals with at least one parental APOE genotype. Among the 1044, there were 472 with elevated total cholesterol (greater than 200 mg per 100 ml), deriving from 427 independent nuclear families. Therefore, there were 229 dyads and 198 triads included in the analysis.

In Table 7, we display the distribution of parental genotypes among dyads and triads by offspring genotypes. The logistic regression of missing status yielded a P-value of 0.8624 for parental genotype adjusting for offspring genotype. Therefore, there was no statistically significant evidence of informative missingness of parental APOE genotypes in the families with offspring who have elevated total cholesterol at the Framingham Heart Study.

Table 7 Distribution of APOE genotypes

Full size table

Discussion

For the case–parent triads design, the TDT² cannot include families with incomplete parental genotypes. Approaches such as the 1-TDT⁴ and EM-HRR⁷ were designed to include such families due to missingness related to a disease in ascertainment, not to genotyping failure. They may be more powerful than the TDT² but are valid only if the missingness is not informative, that is, missingness independent of the underlying genotype (MAR). Although approaches proposed by Allen et al⁸ and Chen⁹ can include incomplete triads and are valid under informative missingness, they may not be as powerful as 1-TDT⁴ and EM-HRR⁷ when the missing data pattern is truly MCAR. Regardless of different missing data patterns among parental genotypes, the above methods assumed that offspring genotypes were MCAR. Recently, Guo¹⁰ indicated that when offspring genotypes were missing informatively, a circumstance that can arise from ascertainment bias, inflated type-I error and/or reduced power may occur using the TDT excluding incomplete triads.

The purpose of this work is to provide a test for informative missingness in the context of case–parent triad designs for genetic linkage and/or association studies, in an effort to avoid a biased conclusion. Our approach compares the parental genotype distribution in triads to that of dyads conditional on the genotypes of affected offspring. Differential parental genotype distributions in triads and dyads indicate that parental genotypes are missing informatively. We have shown, through theoretical derivations and computer simulations, that TIM is not affected by linkage (θ) or association (δ). It provides expected 5% type-I error at α=0.05 level under MCAR and is robust to population admixture. Simulation results suggest that TIM has adequate power to test informative missingness in moderately sized sample. In the logistic regression framework, TIM remains a valid test under MAR by conditioning on available covariates X₁, X₂,…,X_K related to missingness.

Given a significant TIM result, assuming that informative missingness exists only in parental genotypes due to, for example, a late onset fatal disease such as cardiovascular disease, Allen et al⁸ and Chen's⁹ strategies are recommended to incorporate dyads. Otherwise, the 1-TDT⁴ and EM-HRR⁷ are appropriate and may provide higher power. However, one should be aware of the basic assumption of absence of ascertainment bias in all TDT/family-based association tests. If TIM is significant for an early onset fatal disease, one should be aware that none of existing methods yields a valid result, including the TDT with only complete trios, as illustrated and discussed by Guo.¹⁰

The proposed test TIM is developed for case–parent triads designs. However, it is readily applicable to other designs consisting of parents and affected offspring by selecting triads and dyads from the data. But this may not be the most powerful approach due to deletion of sibling information. Therefore, our future work will extend TIM to consider more general pedigrees. Many recently published genome-wide association studies (GWAs) are case–control designs, but there are also family-based GWAs with available genotyping on parent–offspring trios, such as the Framingham Heart Study and International Multi-Center ADHD Genetics Project (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap). It is likely that more family-based GWA studies using DNA already collected in the past emerge in the near future. The advantage of parent–offspring design over case–control design is that the former design is immune for population admixture, and it tests for linkage as well as association. Therefore, our method may be useful for these designs to further reduce false positives due to informative missingness in the modern era of genome-wide studies.

References

Falk CT, Rubinstein P : Haplotype relative risks: an easy reliable way to construct a proper control sample for risk calculations. Ann Hum Genet 1987; 51: 227–233.
Article CAS PubMed Google Scholar
Spielman RS, McGinnis RE, Ewens WJ : Transmission test for linkage disequilibrium: the sinsulin gene region and insulin dependent diabetes mellitus. Am J Hum Genet 1993; 52: 506–516.
CAS PubMed PubMed Central Google Scholar
Curtis DR, Sham PC : A note on the application of the transmission disequilibrium test when a parent is missing. Am J Hum Genet 1995; 56: 811–812.
CAS PubMed PubMed Central Google Scholar
Sun F, Flanders W, Yang Q, Khoury J : Transmission disequilibrium test (TDT) with only one parent is available: the 1-TDT. Am J Epidemiol 1999; 150: 97–104.
Article CAS PubMed Google Scholar
Clayton D : A generalization of the transmission/disequilibrium test for uncertain haplotype transmission. Am J Hum Genet 1999; 65: 1170–1177.
Article CAS PubMed PubMed Central Google Scholar
Weinberg CR : Allowing for missing parents in genetic studies of case-parents triads. Am J Hum Genet 1999; 64: 1186–1193.
Article CAS PubMed PubMed Central Google Scholar
Guo CY, Destefano AL, Lunetta KL, Dupuis J, Cupples LA : Expectation maximization algorithm based haplotype relative risk (EM-HRR): test of linkage disequilibrium using incomplete case-parents trios. Hum Hered 2005; 59: 125–135.
Article PubMed Google Scholar
Allen AS, Rathouz PJ, Satten GA : Informative missingness in genetic association studies: case-parent designs. Am J Hum Genet 2003; 72: 671–680.
Article CAS PubMed PubMed Central Google Scholar
Chen YH : New approach to association testing in case-parent designs under informative parental missingness. Genet Epidemiol 2004; 27: 131–140.
Article PubMed Google Scholar
Guo CY : Validity of the transmission/disequilibrium test (TDT) under impact of complex informative missingness. BMC Proc 2007; 1 (Suppl 1): S26.
Article PubMed PubMed Central Google Scholar
Ott J : Statistical properties of the haplotype relative risk. Genet Epidemiol 1989; 6: 127–130.
Article CAS PubMed Google Scholar
Little RJA, Rubin DB : Statistical Analysis with Missing Data. New York: Wiley, 1987.
Google Scholar
Guo CY, Cui J, Cupples LA : Impact of non-ignorable missingness on genetic tests of linkage and/or association using case-parents trios. BMC Genet 2005; 6 (Suppl 1): S90.
Article PubMed PubMed Central Google Scholar
Dawber TR, Meadors GF, Moore FE : Epidemiologic approaches to heart disease: the Framingham study. Am J Public Health 1951; 41: 279–286.
Article CAS Google Scholar
Dawber TR, Kannel WB, Lyell LP : An approach to longitudinal studies in a community: the Framingham heart study. Ann NY Acad Sci 1963; 107: 539–556.
Article CAS PubMed Google Scholar
Feinleib M, Kannel WB, Garrison RJ, McNamara PM, Castelli WP : The Framingham offspring study. Design and preliminary data. Prev Med 1975; 4: 518–525.
Article CAS PubMed Google Scholar
Kannel WB, Feinleib M, McNamara PM, Garrison RJ, Castelli WP : An investigation of coronary heart disease in families. The Framingham offspring study. Am J Epidemiol 1979; 110: 281–290.
Article CAS PubMed Google Scholar
Ordovas JM, Litwack-Klein L, Wilson PW, Schaefer MM, Schaefer EJ : Apolipoprotein E isoform phenotyping methodology and population frequency with identification of apoE1 and apoE5 isoforms. J Lipid Res 1987; 28: 371–380.
CAS PubMed Google Scholar
Jarvik GP, Austin MA, Fabsitz RR et al: Genetic influences on age-related change in total cholesterol, low density lipoprotein-cholesterol, and triglyceride levels: longitudinal apolipoprotein E genotype effects. Genet Epidemiol 1994; 11: 375–384.
Article CAS PubMed Google Scholar
Jarvik GP, Goode EL, Austin MA et al: Evidence that the apolipoprotein E-genotype effects on lipid levels can change with age in males: a longitudinal analysis. Am J Hum Genet 1997; 61: 171–181.
Article CAS PubMed PubMed Central Google Scholar
Kallio MJ, Salmenpera L, Siimes MA, Perheentupa J, Gylling H, Miettinen TA : The apolipoprotein E phenotype has a strong influence on tracking of serum cholesterol and lipoprotein levels in children: a follow-up study from birth to the age of 11 years. Pediatr Res 1998; 43: 381–385.
Article CAS PubMed Google Scholar
Fulton JE, Dai S, Grunbaum JA, Boerwinkle E, Labarthe DR : Apolipoprotein E affects serial changes in total and low-density lipoprotein cholesterol in adolescent girls: Project HeartBeat!. Metabolism 1999; 48: 285–290.
Article CAS PubMed Google Scholar
Srinivasan SR, Ehnholm C, Elkasabany A, Berenson G : Influence of apolipoprotein E polymorphism on serum lipids and lipoprotein changes from childhood to adulthood: the Bogalusa Heart Study. Atherosclerosis 1999; 143: 435–443.
Article CAS PubMed Google Scholar
Hak AE, Witteman JC, Hugens W et al: The increase in cholesterol with menopause is associated with the apolipoprotein E genotype. A population-based longitudinal study. Atherosclerosis 2004; 175: 169–176.
Article PubMed Google Scholar
Guo CY, Lunetta KL, DeStefano AL, Ordovas JM, Cupples LA : Informative transmission disequilibirum test (i-TDT): combined linkage and association mapping that includes unaffected offspring as well as affected offspring. Genet Epidemiol 2007; 31: 115–133.
Article PubMed Google Scholar

Download references

Acknowledgements

This work was supported by, in part, National Heart, Lung and Blood Institute's Framingham Heart Study (Contract No. N01-HC-25195). We acknowledge the support of the Genomics Program at Children's Hospital Boston. We thank anonymous reviewers and the editor for their insightful comments and suggestions.

Author information

Authors and Affiliations

Clinical Research Program, Children's Hospital Boston, Boston, MA, USA
Chao-Yu Guo
Department of Medicine, Program in Genomics, Children's Hospital Boston, Boston, MA, USA
Chao-Yu Guo
Department of Pediatrics, Harvard Medical School, Boston, MA, USA
Chao-Yu Guo
Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
Laura Adrienne Cupples & Qiong Yang

Authors

Chao-Yu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Laura Adrienne Cupples
View author publications
You can also search for this author in PubMed Google Scholar
Qiong Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chao-Yu Guo.

Appendix 1

Distribution of ascertained triads and dyads

First, we assume that the data consists of genotypes of bi-allelic markers such as a single nucleotide polymorphism. Therefore, there are exactly two alleles, B₁ and B₂, at the marker locus. We first derive the distribution of complete triads as the following: Let G_o, G_f, G_m be the offspring's, father's, and mother's genotypes, respectively. Let G_of and G_om be the allele of offspring inherited from the father and mother, respectively. Then, G_o, when it is heterozygous, really represents a set of two possible pairs of values, (G_of=B₁, G_om=B₂) or (G_of=B₂, G_om=B₁). Let I_f, I_m, and I_o be binary indicator functions for father, mother, and offspring having missing genotype information. For example, I_f=1 if the father's genotype is missing and 0 otherwise.

Here, we do not consider imprinting and the four possible joint probabilities of a given parental genotype and the probability of transmitting a given allele to the offspring from that parent, all conditional on offspring affected status are:

When the disease model is recessive, Ott¹¹ (Table 2) showed that μ=(s+δ/r)s, ν=(s+δ/r) (1−s)−θδ/r, ζ=(1−s−δ/r)s+θδ/r and τ=(1−s−δ/r) (1−s), where ‘r’ is the allele frequency of the recessive disease allele ‘a’, and ‘s’ is the allele frequency of marker allele ‘B₁’. The parameter θ denotes the recombination fraction, and δ=p(aB₁)−p(a)p(B₁) denotes the disequilibrium coefficient between the marker and the disease locus. The conditional probabilities under the dominant or additive disease model can be derived similarly.

Assuming random mating and no missing parental genotype in the population, the probability of ascertaining a triad with the father, mother, and affected offspring's genotypes being B₁B₁, B₁B₂, and B₁B₂, respectively, is Pr(G_f=(B₁B₁); G_m=(B₁B₂); G_o=(B₁B₂)∣affected offspring)=μ × ζ.

However, it is unrealistic to assume the completeness of parental genotypes when collecting sample. For example, parental genotypes may not be available due to death from the disease under study (ie, missing pattern of parental genotypes is related to the disease under study or informative missingness) or due to random refusal of participation (MCAR). Allowing for differential missing rates for offspring, fathers, and mothers, let P_o11, P_o12, and P_o22 denote missing rates for offspring with B₁B₁, B₁B₂, and B₂B₂ genotypes, respectively. Similarly, let P_f11, P_f12, and P_f22 (P_m11, P_m12, and P_m22) denote missing rates for father (mother) with B₁B₁, B₁B₂, and B₂B₂ genotypes, respectively. Note that we do not assume any pattern for the nine missing parameters, ie, missingness of a given parent's genotype can be dependent or independent of the other parent's and/or offspring's genotype.

Take the above missing parameters into consideration, the conditional probability of ascertaining a complete triad with the father, mother, and affected offspring's genotypes being B₁B₁, B₁B₂, and B₁B₂, respectively, is Pr(I_f=0&G_f=(B₁B₁); I_m=0&G_m=(B₁B₂); I_o=0&G_o=(B₁B₂)∣affected offspring)=μ × ζ × (1−P_f11) × (1−P_m12) × (1−P_o12). The rest probabilities can be derived in the same manner and are displayed in Table 1.

Null distribution of TIM under MCAR

If genotypes are MCAR, then the probability of missing a subject is independent of the subject's genotype, ie, P_o11=P_o12=P_o22=P_o, P_f11=P_f12=P_f22=P_f, P_m11=P_m12=P_m22=P_m. Note that P_o, P_f, and P_m are three parameters and need not to be identical. In addition, and . Under the null hypothesis of MCAR and conditional on genotypes of affected offspring B₁B₁, the proportion of parents with B₁B₁ genotypes among triads

is equivalent to that of dyads

which are both equivalent to μ × (μ+ν). The rest conditional null distributions can be derived in a similar manner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, CY., Cupples, L. & Yang, Q. Testing informative missingness in genetic studies using case–parent triads. Eur J Hum Genet 16, 992–1001 (2008). https://doi.org/10.1038/ejhg.2008.38

Download citation

Received: 07 April 2007
Revised: 23 October 2007
Accepted: 01 February 2008
Published: 12 March 2008
Issue Date: August 2008
DOI: https://doi.org/10.1038/ejhg.2008.38

Keywords

This article is cited by

A novel test of informative missingness using inconsistent linkage disequilibrium signals between case-parent triads and incomplete data
- Chao-Yu Guo
Journal of Human Genetics (2012)

Testing informative missingness in genetic studies using case–parent triads

Abstract