A novel test of informative missingness using inconsistent linkage disequilibrium signals between case-parent triads and incomplete data

Guo, Chao-Yu

doi:10.1038/jhg.2012.78

Download PDF

Original Article
Published: 28 June 2012

A novel test of informative missingness using inconsistent linkage disequilibrium signals between case-parent triads and incomplete data

Chao-Yu Guo^1,2,3

Journal of Human Genetics volume 57, pages 601–609 (2012)Cite this article

420 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

In general, multiple issues are examined before the analysis of genetic data such as Hardy–Weinberg Equilibrium and Mendelian errors. Although missing genotypes are commonly observed in genetic studies, potential bias due to informative missingness is usually overlooked. Therefore, the Test of Informative Missingness (TIM) was the first attempt to determine whether or not parental genotypes are missing informatively. The TIM is a useful tool for genetic data cleaning. For example, excluding single-nucleotide polymorphisms that appear to be missing informatively may further improve the quality of genetic data. Although the TIM has decent power, its performance is discernibly weaker when the minor allele/genotype introduces informative missingness. In an effort to avoid such reduced power, the newly proposed strategy detects informative missingness by comparing inconsistent linkage disequilibrium signals between intact case-parent triads and incomplete data. Computer simulations revealed that the new method was robust to population stratifications and more powerful than the TIM in most situations. In addition, the new method demonstrated decent power in the genome-wide association study, even if the most conservative correction for multiple testing was adopted.

Genome-wide association studies

Article 26 August 2021

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Article Open access 12 April 2024

Introduction

Spurious associations due to population admixture could be a serious issue in genetic studies using unrelated subjects. To avoid false signals, the family-based approach, haplotype relative risk (HRR),¹ utilizes case-parent triads to detect linkage disequilibrium (LD) between a marker and a putative disease locus by comparing parental marker alleles transmitted to an affected offspring to those non-transmitted. Instead of treating transmitted and non-transmitted alleles as unrelated, the Transmission/Disequilibrium Test (TDT)² considered case-parent triads as matched data and examined whether or not heterozygous parents preferentially transmitted the specific allele to the affected offspring. The TDT is more powerful than the HRR, especially when population admixture is present. Therefore, the TDT is a popular study design for early onset diseases.

The greatest challenge in recruiting case-parent triads is that one or both parental genotypes may be unavailable due to declined participation, death, or other unexpected reasons. In the statistical analysis, both missing completely at random (MCAR) and missing at random (MAR) are ignorable.³ If the events that lead to any particular value being missing are independent of both observed and unobserved parameters of interest, then the missing pattern is considered as MCAR. Given the observed data, if the missing mechanism does not depend on the unobserved data, then the missing pattern is MAR. The scenarios of MCAR and MAR could be confusing in the settings of a genetic study. For a single nucleotide polymorphism (SNP) with alleles A and C, there are three genotypes AA, AC and CC. In an admixed population, if the missing rates of the three genotypes are identical in all sub-populations, then the missing pattern is MACR. If the missing rates of the three genotypes are identical within each subgroup, but the missing rates differ across subgroups, then the missing pattern appears to be MAR. Two distinct types of missingness in genotype data should be noted due to different mechanisms. The first situation is that individuals may be unavailable due to death or non-participation. Therefore, there can be different missing rates for the offspring and their parents. As a result, informative missingness could occur solely in the parents, but not the offspring. The second situation is that the genotyping assay may have failed to deliver a ‘call’ at a particular locus for a particular specimen, even though the person was participating. The scenario may depend on the true genotype (hence be informative), but may not differ across individuals. As a result, informative missingness would exist in both the offspring and parents.

In 1995, the estimated probability of transmission of certain alleles⁴ was pointed out to be biased in the TDT using dyads (the affected offspring with only one parent), where only heterozygous parents and homozygous offspring contributed to the test. The 1-TDT⁵ was free from such bias when parental genotypes were MCAR or MAR. In addition to the 1-TDT, the family-based association test by Rabinowitz and Laird⁶ as well as several other strategies had been proposed^{7, 8, 9, 10, 11} to accommodate incomplete triads. When parental genotypes were missing informatively, Allen et al.¹² and Chen¹³ carried out valid tests to incorporate incomplete data. However, the two methods experienced substantially reduced statistical power when the underlying missing pattern was truly MCAR/MAR as discussed by Guo et al.¹⁴ Although various scenarios had been well studied, all methods^{5, 6, 7, 8, 9, 10, 11, 12, 13, 14} focused on the missing pattern of parental genotypes and assumed that the offspring genotypes were MCAR/MAR. When the assumption of MCAR/MAR was violated among offspring genotypes, Guo¹⁵ indicated that the TDT using only complete triads may still inflate the type-I error and/or reduce power due to ascertainment bias. This phenomenon suggests that if the missing pattern of offspring genotypes is not determined, a significant result of the TDT may not assure a true association, even if incomplete triads are excluded from the analysis. Therefore, missing mechanism is an important issue in analyzing genetic data.

The first attempt to determine whether or not parental genotypes are missing informatively was introduced by Guo et al.,¹⁴ the Test of Informative Missingness (TIM), which compared the distribution of parental genotypes in triads with that of dyads, conditional on the genotypes of affected offspring. Differential distributions of parental genotypes in triads and dyads indicated that the missing pattern of parental genotypes was not ignorable. The TIM is a valuable tool for genetic data cleaning. A novel application for the TIM was to exclude SNPs that are missing informatively in a genome-wide association study (GWAS). In this way, fewer yet more reliable SNPs will be analyzed and this procedure may effectively reduce the excessive amount of false positives in the analysis. In the era of GWAS, one million SNPs are considered as the standard. SNPs with missing rates exceeding a specific threshold are now routinely excluded, because inclusions of SNPs with higher missing rates lead to too many significant results, which are thought to be false positives. This strict enforcement of nearly complete data raises some important issues, since excessive rates of significant results in a typical GWAS may be potentially caused by informative missingness.

Although the TIM demonstrates decent power, its performance is discernibly weaker when the minor allele/genotype introduces informative missingness. Insights of such reduced power could be comprehended by the following example. Assuming that the minor (major) allele is A (C) and the corresponding allele frequency is 0.3 (0.7). The frequencies of genotypes AA, AC and CC are 0.09, 0.42 and 0.49, respectively. If 10% of the subjects with the genotype AA are missing, but only 1% of the individuals with the genotype AC or CC are not available, then the missing pattern is informative. In a random sample of 10 000 subjects, one would expect that 90, 42 and 49 subjects are missing genotypes AA, AC and CC, respectively. In contrast, if the excessive missingness (10%) occurs in the major genotype CC, but only 1% of the individuals with the genotype AA or AC are absent, then one would expect a much larger number of individuals with missing genotypes, which results in a stronger signal for informative missingness. Since the TIM is conditional of the offspring genotypes, the size of each of the three offspring genotypes has an important role when comparing the distribution of parental genotypes in triads with that of dyads. It is worth noting that the offspring with the genotype AA is the minor group and their contribution to the test statistic of the TIM is less weighted. Hence, power of the TIM is considerably reduced under such circumstances.

In this article, a new strategy, which is not conditional on the genotypes of affected offspring, is proposed to avoid the weakness of the TIM method. This novel method extends the expectation-maximization algorithm-based HRR (EM-HRR),¹¹ which utilized all types of ascertained data (triads, dyads and monads; note that monads are affected offspring without any parent). A previous study¹⁵ had revealed that when parental genotypes were missing informatively, inconsistent LD signals were frequently observed between the EM-HRR that included incomplete data and the HRR that used only complete triads. Since the Breslow-Day test^{16, 17} was designed to test homogeneity of multiple odds ratios, it detects inconsistent estimates of odds ratios from the EM-HRR and HRR. Therefore, the new test of informative missingness is named as TIMBD.

Materials and methods

Following the previous work,¹⁴ let represents the observed sample size for each type of triad data. k=‘0’, ‘1’ or ‘2’ denotes the total number of B₁ alleles transmitted to the offspring, and i, j=‘0’, ‘1’ or ‘2’ denotes the total number of B₁ alleles for the father and mother, respectively. For example, represents the total number of triads where genotypes of the offspring, father and mother are B₁B₂, B₂B₂ and B₁B₂, respectively. Note that the superscript ‘*’ indicates that the parental genotype is missing. For example, represents dyads with the missing father.

Let , , and denote the total number of transmitted alleles for B₁, non-transmitted alleles for B₁, transmitted alleles for B₂ and non-transmitted alleles for B₂, respectively, where the superscript s indicates the family types (s=1 for complete triads, s=2 for dyads and s=3 for monads).

The HRR is only applicable for complete triads, where . Unlike the original EM-HRR that utilizes both complete and incomplete data, the EM-HRR statistic in this article only includes dyads and monads such that . The EM-HRR statistic uses the same proportions estimated from the original EM-HRR and detailed calculations of the HRR and EM-HRR are displayed in Table 1.

Table 1 The HRR and EM-HRR statistics for the TIMBD

Full size table

Let denotes the total number of alleles obtained from complete data and denotes the total number of alleles derived from incomplete data. Assuming the absence of genetic heterogeneity, the proofs (see Appendix A for details) indicate that parental genotypes are MCAR/MAR if and only if E(HRR)=E(EM-HRR). As a result, the Breslow-Day test is implemented to detect the inequality of the HRR and EM-HRR. Here, the Mantel-Haenszel Odds Ratio is defined as:

The TIMBD is computed as:

(see Appendix B for details). Since the Breslow-Day test is available in many statistical packages, the TIMBD is not computing intensive. Under the null hypothesis of MCAR/MAR, the TIMBD has an asymptotic χ² distribution with one degree of freedom.

It is worth noting that both the HRR and EM-HRR are robust to population stratifications, even if allele frequencies in the sub-populations are extremely different. Hence, the TIMBD, which is based on the HRR and EM-HRR, is also robust to population admixture and remains a valid test under MAR.

Simulations

To provide fair comparisons, similar simulation schemes of the TIM¹⁴ were adopted. Considering an SNP, simulations begin with the assumption that the population is under the Hardy–Weinberg Equilibrium. Let ‘a’ and ‘A’ denote the disease allele and normal allele, respectively. ‘D’ means that an individual is diseased or affected. Let ‘f’ denotes the probability of being affected when an individual carries 0 risk alleles (the phenocopy rate), and let ‘K’ denotes the genotype relative risk. For a recessive disease model, the penetrance functions are P(D∣AA)=P(D∣Aa)=f and P(D∣aa)=K × f, where 0⩽f⩽1 and 0⩽K × f⩽1. The disease prevalence is determined by these probabilities and the risk allele frequency. Similarly, for a dominant disease model, P(D∣AA)=f and P(D∣Aa)=P(D∣aa)=K × f. In addition, the confined additive model was also created as P(D∣AA)=f; P(D∣Aa)=min(K × f, 1); P(D∣aa)=min(2 × K × f, 1). The affection status of each individual was determined according to these parameters.

Several disease allele frequencies as well as marker allele frequencies were examined. A range of possible values for the disequilibrium coefficient δ and recombination fraction θ were simulated. The frequencies of the disease and marker alleles, the disease model, the phenocopy rate and the penetrance rate are indicated in each table. According to these parameters, a general population was simulated where nuclear families have exactly one offspring. Parental genotypes under the Hardy–Weinberg Equilibrium were first simulated. Then based on the Mendelian law, offspring genotypes were then generated for each household. After genotypes were simulated for every triad, the disease status of the offspring was determined by the offspring genotype, the disease penetrance rate and the phenocopy rate. The next step was to create the missing data, where the parental genotypes as well as the offspring genotypes were assigned to be absent according to various missing rates, which were clearly indicated in the tables. The last step was to randomly select probands (triads, dyads and monads) from the simulated population.

In the second set of simulations, population stratifications were considered. The previous scheme¹⁴ was adopted and two populations were sampled under the Hardy–Weinberg Equilibrium with expected samples sizes reflecting different disease allele frequencies in the two populations. For example, for a pure recessive model, if the disease allele frequencies of the two populations are 0.3 and 0.6, respectively, then 9% of the first and 36% of the second population would be affected and sampled. Therefore, one would expect 20 and 80% of the sample to come from the first and second populations, respectively. This is the ratio that one would observe in most samples with admixture. Because the disease allele frequencies are different in the two populations, the frequencies of the diseased individuals in the two samples are also different. The disease allele frequencies, the marker allele frequencies, the phenocopy rates and the penetrance rates for the two populations were indicated in the tables.

The simulations were repeated 10 000 (1000) times to examine type-I error (power) of several tests examined including the TIMBD. In general, parents of the affected offspring are difficult to recruit. Therefore, the missing rates ranged from 1 to 40% in computer simulations. Examples were the missing rates derived from the Framing Heart Study,¹¹ where the missing rate for systolic blood pressure was as high as 91% (247/271).

In this article, power under the GWAS scenario was also examined. Assuming that one million SNPs were tested, a much large sample size of 5000 triads was considered. In addition, the most stringent correction for multiple testing was adopted. Therefore, P-values that were smaller than the Bonferonni’s adjusted α(5 × 10⁻⁸) could be declared significant. A total of 10 000 repetitions were done for the GWAS scenario.

In Tables 2, 3, 4, 5, 6, 7, the column marked ‘TDT’ reports results using the traditional TDT test on the subset of complete triads only. The column marked ‘1-TDT’ uses both the complete triads and dyads. The column marked ‘TIM’ is the test of informative missingness¹⁴ and the last column ‘TIMBD’ represents the new strategy proposed in this article. Allen et al.¹² commented that the original 1-TDT should not be used. Thus, the modified 1-TDT was used, but not the original 1-TDT, in computer simulations.

Table 2 Type-I error (%) of the TIMBD in a homogeneous population assuming MCAR

Full size table

Table 3 Type-I error (%) of the TIMBD under population admixture with a moderate marker allele difference assuming MCAR/MAR

Full size table

Table 4 Type-I error (%) of the TIMBD under population admixture with an extreme marker allele difference assuming MCAR/MAR

Full size table

Table 5 Power (%) of the TIMBD assuming no linkage ( θ =0.5) or association ( δ =0)

Full size table

Table 6 Power (%) of the TIMBD assuming linkage (θ=0) and association (δ=0.1)

Full size table

Table 7 GWAS scenarios—Power (%) of the TIMBD after Bonferroni’s correction for multiple testing (α=10⁻⁸) assuming no linkage (θ=0.5) or association (δ=0)

Full size table

Results

Type-I error

When the missing pattern was MCAR for any member of the triads, type-I errors of the TIMBD in a homogeneous population are displayed in Table 2. The disease and marker allele frequencies were 0.3 and 0.4, respectively. The disease penetrance and phenocopy rate were 0.4 and 0.2, respectively. Different disease and marker allele frequencies, penetrance rates and phenocopy rates yielded similar results, which were not shown in the tables. The underlying disease model was indicated in the first column. The second and third columns were the recombination fraction (θ) and disequilibrium coefficient (δ). The three missing rates for the father, mother and offspring were displayed, respectively, in the first, second and third number of the parenthesis in the fourth column. The three missing rates may be different. However, each of the three missing rates was identical for all genotypes, B₁B₁, B₁B₂ and B₂B₂, such that the missing patterns were considered as MCAR for each family member. When there were no linkage (θ=0.5) or association (δ=0), the TDT and 1-TDT showed the expected 5% chance of rejecting the null hypothesis in the upper nine rows of Table 2. When there is linkage (θ=0) and association (δ=0.14), power of the TDT and 1-TDT are displayed in the bottom nine rows of Table 2. Simulation results indicated that type-I errors of the TIM and TIMBD were less than the nominal level of 5%, regardless of the relationship between the marker and the disease alleles. Therefore, test statistics of the TIM and TIMBD were independent of the recombination fraction θ and disequilibrium coefficient δ.

When the parental and offspring genotypes were MCAR or MAR in an admixed population, type-I errors of the TIM and TIMBD were displayed in Tables 3 and 4. The disease penetrance and phenocopy rates were 0.4 and 0.2, respectively. This scenario implies that the genotype relative risk was 2. Higher or lower genotype relative risk yielded similar comparisons and the results were not shown. In Tables 3 and 4, the disease allele frequencies of the first and second populations were 0.2 and 0.6, respectively. Therefore, the degree of admixture was identical in Tables 3 and 4. However, in Tables 3 and 4), the minor marker allele frequencies for the first and second populations were 0.4 and 0.3 (0.6 and 0.2), respectively. Hence, the difference between the marker allele frequencies of the two populations was more extreme in Table 4 than that in Table 3. When the disease and marker allele frequencies were <0.2 or >0.6, the comparisons between the TIM and TIMBD were similar and the results are not shown.

Since the TDT and 1-TDT were robust to population stratifications, both methods demonstrated the expected 5% type-I errors when there was no linkage or association in the upper nine rows of Tables 3 and 4. Under the alternative hypothesis, the TDT showed the lowest power due to exclusions of dyads in the analysis in the bottom nine rows of Tables 3 and 4. Therefore, if the missing pattern was MCAR/MAR, then the 1-TDT was more powerful than the TDT for detecting LD, which matched previous reports by Sun et al.⁵ and Guo et al.¹¹ Although the TIM performed well under population admixture and showed type-I errors <5% in Table 3, its type-I errors could be slightly inflated over 8% in rows 3, 6, 9, 12, 15 and 18 of Table 4. In both scenarios, type-I errors of the TIMBD did not exceed 5%, although it appeared conservative. Therefore, the simulation results revealed that the TIMBD was robust to population admixture, while the TIM may suffer slightly inflated type-I errors.

Notes of the Breslow-Day test¹⁸ indicated its requirement of large sample sizes in stratums and behavior under ‘small stratum’ settings that introduced the conservative type-I error. In the simulations, the sample size of the EM-HRR (i.e., missing data stratum) was not large to reflect real life scenarios, where the proportion of missing data was not too high. Therefore, the type-I error of the TIMBD was slightly conservative. Note that the average marker allele frequency in Table 3 was higher than that in Table 4. As a result, the type-I error of the TIMBD decreased from Table 3 to Table 4. In other words, decreasing marker allele frequencies introduced more conservative type-I errors of the TIMBD and such pattern matched the previous results.¹⁸ Regardless, the TIMBD did not yield the inflated type-I error and remained a valid test, even if the sample sizes in some stratums were small.

Power

Simulation results displayed in Table 5 (no association (δ=0) or linkage (θ=0.5)) and in Table 6 (association (δ=0.1) and linkage (θ=0)) were circumstances under which genotypes of triads were missing informatively in a homogeneous population. The disease and marker allele frequencies were 0.3 and 0.4, respectively. The disease penetrance and phenocopy rate were 0.4 and 0.2, respectively. The following two scenarios were examined: (1) the odd rows in each disease model: informative missingness occurred solely in parents but not the offspring. One can see that the missing rate (15 or 10%) was identical for any offspring genotype; (2) the even rows in each disease model: informative missingness occurred in both offspring and parental genotypes.

In Table 5, the TDT using the subset of complete triads remained a valid test for LD under the first scenario (the odds rows), since the TDT revealed type-I errors approaching the 5% nominal level. However, the 1-TDT, which used both triads and dyads, showed the inflated type-I errors and the inflation increased with respect to the magnitude of informative missingness. Under the second scenario, the TDT and 1-TDT were no longer valid tests, but the 1-TDT was less inflated than the TDT. In either scenario, the TIMBD was consistently more powerful than the TIM and the difference was more discernible when informative missingness was introduced by the minor allele (B₁)/genotype (B₁B₁) (rows 3, 4, 7 and 8 in each disease model).

In Table 6, power of the 1-TDT was lower (higher) than the TDT, when the major (minor) genotype introduced the informative missingness. This fact suggested that including dyads in the analysis could either dampen or inflate power of the 1-TDT when the assumption of MCAR/MAR was violated, which matched the previous investigations.^{19, 20} The results revealed an important message that informative missingness could also prevent discoveries of putative disease genes.

The GWAS scenarios assuming no linkage or association were displayed in Table 7. The results were adjusted by the Bonferroni’s correction for multiple testing (the adjusted α=5 × 10⁻⁸). The TIMBD demonstrated decent power in the GWAS scenarios. The results also revealed that the TDT could yield considerable false positives (the second row in each disease model), even if the correction for multiple testing was implemented. This phenomenon illustrated the relationship between excessive false positives and informative missingness in the GWAS analysis.

Discussion

Unlike the TIM, which is conditional on the offspring genotypes, the novel strategy TIMBD detects informative missingness by inconsistent LD signals between the complete and incomplete data. Attributable to its family-based design, the TIMBD is robust to population stratifications and outperforms the TIM in most situations. The excessive false positives solely due to informative missingness were also observed in the GWAS scenarios. The TIMBD is applicable for general pedigrees, when independent triads, dyads and monads are identified from the independent pedigrees (see Supplementary data for the application in SAS/STAT software, SAS Institute inc., Cary, NC, USA).

In addition to non-random genotyping failure, which introduces informative missingness in both the offspring and parents, informative missingness may occur due to death or refusal to participate related to the outcome. One example to consider is asthma,^{21, 22} which could be diagnosed in both children and adults. The other plausible scenario is informative missingness in the parents, but not the offspring, as seen in age-dependent diseases, such as cancer,²³ Parkinson’s disease,²⁴ diabetes²⁵ and cardiovascular diseases.^{26, 27} Same as the TIM, the limitation of the TIMBD is that it could not detect informative missingness that exists solely in the offspring, but not the parents, which could be classified as ascertainment bias. However, the TIMBD could be the foundation and/or step stones for considering ascertainment bias in genetic studies, since it could determine whether or not parental genotypes are missing informatively.

It is worth noting that the HRR and EM-HRR are based on the 2 × 2 contingency tables, hence the TIMBD could be easily extended into the logistic regression framework to adopt the Breslow-Day test in the logit model. In this way, the TIMBD could adjust for covariates related to missingness and ensures a valid test under various conditions of MAR as discussed previously.¹⁴

References

Falk, C. T. & Rubinstein, P. Haplotype relative risks: an easy reliable way to construct a proper control sample for risk calculations. Ann. Hum. Genet. 51, 227–233 (1987).
Article CAS Google Scholar
Spielman, R. S., McGinnis, R. E. & Ewens, W. J. Transmission test for linkage disequilibrium: the sinsulin gene region and insulin dependent diabetes mellitus. Am. J. Hum. Genet. 52, 506–516 (1993).
CAS PubMed PubMed Central Google Scholar
Little, R. J. A. & Rubin, D. B. Statistical Analysis with Missing Data. 2nd edn. (Wiley, New York, 2002).
Book Google Scholar
Curtis, D. R. & Sham, P. C. A note on the application of the transmission disequilibrium test when a parent is missing. Am. J. Hum. Genet. 56, 811–812 (1995).
CAS PubMed PubMed Central Google Scholar
Sun, F., Flanders, W., Yang, Q. & Khoury, J. Transmission Disequilibrium Test (TDT) with only one parent is available: The 1-TDT. Am. J. Epidemiol. 150, 97–104 (1999).
Article CAS Google Scholar
Rabinowitz, D. & Laird, N. M. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum. Hered. 504, 227–233 (2000).
Google Scholar
Goring, H. H. & Terwilliger, J. D. Linkage analysis in the presence of errors IV: joint pseudomarker analysis of linkage and/or linkage disequilibrium on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified. Am. J. Hum. Genet. 66, 1310–1327 (2000).
Article CAS Google Scholar
Clayton, D. A generalization of the transmission/disequilibrium test for uncertain haplotype transmission. Am. J. Hum. Genet. 65, 1170–1177 (1999).
Article CAS Google Scholar
Weinberg, C. R. Allowing for missing parents in genetic studies of case-parents triads. Am. J. Hum. Genet. 64, 1186–1193 (1999).
Article CAS Google Scholar
Gordon, D., Haynes, C., Johnnidis, C., Patel, S. B., Bowcock, A. M. & Ott, J. A. transmission disequilibrium test for general pedigrees that is robust to the presence of random genotyping errors and any number of untyped parents. Eur. J. Hum. Genet. 12, 752–761 (2004).
Article CAS Google Scholar
Guo, C. Y., Destefano, A. L., Lunetta, K. L., Dupuis, J. & Cupples, L. A. Expectation Maximization Algorithm Based Haplotype Relative Risk (EM-HRR): test of linkage disequilibrium using incomplete case-parents trios. Hum. Hered. 59, 125–135 (2005).
Article Google Scholar
Allen, A. S., Rathouz, P. J. & Satten, G. A. Informative missingness in genetic association studies: case-parent designs. Am. J. Hum. Genet. 72, 671–680 (2003).
Article CAS Google Scholar
Chen, Y. H. New approach to association testing in case-parent designs under informative parental missingness. Genet. Epidemiol. 27, 131–140 (2004).
Article Google Scholar
Guo, C. Y., Cupples, L. A. & Yang, Q. Testing informative missingness in genetic studies using case–parent triads. Eur. J. Hum. Genet. 16, 992–1001 (2008).
Article CAS Google Scholar
Guo, C. Y. The impact of complex informative missingness on the validity of the transmission/disequilibrium test (TDT). BMC Proc. 1 (Suppl 1), S26 (2007).
Article Google Scholar
Breslow, N. E. & Day, N. E. Statistical Methods in Cancer Research, Volume II: The Design and Analysis of Cohort Studies (Oxford University Press, USA, 1994).
Google Scholar
Breslow, N. E. & Day, N. E. Statistical Methods in Cancer Research, Volume I: The Analysis of Case-Control Studies (Oxford University Press, Inc., New York, 1993).
Google Scholar
Paul, S. R. & Donner, A. Small sample performance of tests of homogeneity of odds ratios in K 2 × 2 table. Stat. Med. 11, 159–165 (1992).
Article CAS Google Scholar
Guo, C. Y., Cui, J. & Cupples, L. A. Impact of non-ignorable missingness on genetic tests of linkage and/or association using case-parents trios. BMC Genet. 6 (Suppl 1), S90 (2005).
Article Google Scholar
Guo, C. Y. Validity of the transmission/disequilibrium test (TDT) under impact of complex informative missingness. BMC Proc. 1 (Suppl 1), S26 (2007).
Article Google Scholar
Weiss, S. T. & Silverman, E. K. Pro: Genome-Wide Association Studies (GWAS) in Asthma. Am. J. Respir. Crit. Care Med. 184, 631–633 (2011).
Article Google Scholar
Adcock, I. M. & Barnes, P. J. Con: genome-wide association studies have not been useful in understanding asthma. Am. J. Respir. Crit. Care Med. 184, 633–636 (2011).
Article Google Scholar
Barrett, J. H., Iles, M. M., Harland, M., Taylor, J. C., Aitken, J. F., Andresen, P. A. et al. Genome-wide association study identifies three new melanoma susceptibility loci. Nat. Genet. 43, 1108–1113 (2011).
Article CAS Google Scholar
Simón-Sánchez, J., van Hilten, J. J., van de Warrenburg, B., Post, B., Berendse, H. W., Arepalli, S. et al. Genome-wide association study confirms extant PD risk loci among the Dutch. Eur. J. Hum. Genet. 19, 655–661 (2011).
Article Google Scholar
Kooner, J. S., Saleheen, D., Sim, X., Sehmi, J., Zhang, W., Frossard, P. et al. Genome-wide association study in individuals of South Asian ancestry identifies six new type 2 diabetes susceptibility loci. Nat. Genet. 43, 984–989 (2011).
Article CAS Google Scholar
Newton-Cheh, C., Guo, C. Y., Wang, T. J., O’donnell, C. J., Levy, D. & Larson, M. G. Genome-wide association study of electrocardiographic and heart rate variability traits: the Framingham Heart Study. BMC Med. Genet. 8, S7 (2007).
Article Google Scholar
Cupples, L. A., Arruda, H. T., Benjamin, E. J., D’Agostino, R.B., Demissie, S., Destefano, A. L. et al. The Framingham Heart Study 100K SNP genome-wide association study resource: overview of 17 phenotype working group reports. BMC Med. Genet. 8, S1 (2007).
Article Google Scholar
Paul, S. R. & Donner, A. A comparison of tests of tests of homogeneity of odds ratios in K 2 × 2 table. Stat. Med. 8, 1455–1468 (1989).
Article CAS Google Scholar

Download references

Acknowledgements

This work is supported by the research grant ‘NSC-99-2314-B-006-053’ awarded by National Science Council, Taiwan. It is also partially supported by a grant from the Ministry of Education, Aim for the Top University Plan. I appreciate much the insightful and valuable comments raised by the two reviewers that substantially improved this article.

Author information

Authors and Affiliations

Division of Biostatistics, Institute of Public Health, National Yang Ming University, Taipei, Taiwan, ROC
Chao-Yu Guo
Head and Neck Cancer Research Program, Cancer Research Center, National Yang Ming University, Taipei, Taiwan, ROC
Chao-Yu Guo
Genome Research Center, National Yang Ming University, Taipei, Taiwan, ROC
Chao-Yu Guo

Authors

Chao-Yu Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chao-Yu Guo.

Additional information

Supplementary Information accompanies the paper on Journal of Human Genetics website

Supplementary information

Supplementary Data 1 (TXT 11 kb)

Supplementary Data 2 (DOC 43 kb)

Supplementary Data 3 (CSV 21 kb)

Appendices

Appendix A

Let represents the theoretical probability for each type of triad data. k=‘0’, ‘1’ or ‘2’ denotes the total number of B₁ alleles transmitted to the offspring, and i, j=‘0’, ‘1’ or ‘2’ denotes the total number of B₁ alleles for the father and mother, respectively.

Assuming the absence of heterogeneity, when parental genotypes were incomplete, Guo et al.¹¹ applied the EM algorithm to estimate the proportion of heterozygous parents who transmitted the B₁ allele but not the other B₂ allele, which was denoted by . Similarly, heterozygous parents who transmitted the B₂ allele but not the other B₁ allele were denoted by . The EM-HRR by Guo et al.¹¹ avoided biased results warned by Curtis and Sham.⁴ An important feature of the HRR is that genotypes of affected offspring are always present (assuming no genotyping failure) due to the ascertainment criteria, which collects an affected individual first and then seeks his/her parents. Note that only and are involved with the EM estimates. Since transmitted alleles can be inferred unambiguously, these alleles do not require EM algorithm estimates. In addition, both and are defined as 0, because none of the parental genotypes are present in monads to infer which alleles are not transmitted.

Assuming the absence of heterogeneity, under the null hypothesis of MCAR/MAR, one would expect that the EM estimates of the non-transmitted allele from the one heterozygous parent will be unbiased. Therefore, the following four equations (1, 2, 3, 4) will hold.

In this case, the EM-HRR is expected to yield an identical LD signal (i.e., odds ratio (OR)) to that of the HRR. When there is no linkage (i.e., recombination fraction θ=0.5) or association (that is, disequilibrium coefficient δ=0), E(HRR)=E(EM-HRR)=1. If there is linkage (θ≠0.5) and association (δ≠0), then E(HRR)=E(EM-HRR)≠1.

Under the alternative hypothesis of informative missingness, equations (1, 2, 3, 4) may be violated. As a result, expectations of the HRR and EM-HRR are dissimilar, E(HRR)≠E(EM-HRR), regardless of the LD information.

Following notations in Table 1 and Appendix of Guo et al.,¹⁴ let , , and denote the conditional probability of a parent transmitted B₁ alleles, non-transmitted B₁ alleles, transmitted B₂ alleles and non-transmitted B₂ alleles, respectively, from type s families, where s=1 indicates complete triads, 2 for dyads and 3 denotes monads.

Details of the conditional probability of transmitting and non-transmitting a specific marker allele for all three types of families are displayed in the following:

The HRR using complete triads is defined as and the EM-HRR using dyads and monads is defined as . It is straightforward to show that, under the null hypothesis of MCAR/MAR (P_o11=P_o12=P_o22=P_o, P_f11=P_f12=P_f22=P_f, and P_m11=P_m12=P_m22=P_m), HRR is identical to EM-HRR regardless of linkage (θ) and association information (δ).

If E(HRR)=E(EM-HRR), then the EM estimators are unbiased so that one would expect the offspring and parental genotypes to be MCAR/MAR. Because the underlying distribution/parameters are the same in the complete triads and incomplete data, which forces missing rates for different genotypes to be identical. Therefore, the probability of the Breslow-Day test showing a significant difference between the HRR and EM-HRR is expected to be the predetermined significance level.

Appendix B

Note that and are the expected values of and given OR_MH, and are estimators of the variance of and given the value of OR_MH and conditional on the value of and (see Paul and Donner²⁸ for details). can be found by solving the following quadratic equation:

equation (B.1) takes the unique root in the interval .

V₁ is then obtained as

Similarly, E₂ can be found by solving the following quadratic equation:

equation (B.2) takes the unique root in the interval . V₂ is then obtained as

Equations (B.1 and B.2) involve finding the root of two quadratic equations, followed by the usual summation to obtain the test statistic.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, CY. A novel test of informative missingness using inconsistent linkage disequilibrium signals between case-parent triads and incomplete data. J Hum Genet 57, 601–609 (2012). https://doi.org/10.1038/jhg.2012.78

Download citation

Received: 16 March 2012
Revised: 26 May 2012
Accepted: 28 May 2012
Published: 28 June 2012
Issue Date: September 2012
DOI: https://doi.org/10.1038/jhg.2012.78