Introduction

Spurious associations due to population admixture could be a serious issue in genetic studies using unrelated subjects. To avoid false signals, the family-based approach, haplotype relative risk (HRR),1 utilizes case-parent triads to detect linkage disequilibrium (LD) between a marker and a putative disease locus by comparing parental marker alleles transmitted to an affected offspring to those non-transmitted. Instead of treating transmitted and non-transmitted alleles as unrelated, the Transmission/Disequilibrium Test (TDT)2 considered case-parent triads as matched data and examined whether or not heterozygous parents preferentially transmitted the specific allele to the affected offspring. The TDT is more powerful than the HRR, especially when population admixture is present. Therefore, the TDT is a popular study design for early onset diseases.

The greatest challenge in recruiting case-parent triads is that one or both parental genotypes may be unavailable due to declined participation, death, or other unexpected reasons. In the statistical analysis, both missing completely at random (MCAR) and missing at random (MAR) are ignorable.3 If the events that lead to any particular value being missing are independent of both observed and unobserved parameters of interest, then the missing pattern is considered as MCAR. Given the observed data, if the missing mechanism does not depend on the unobserved data, then the missing pattern is MAR. The scenarios of MCAR and MAR could be confusing in the settings of a genetic study. For a single nucleotide polymorphism (SNP) with alleles A and C, there are three genotypes AA, AC and CC. In an admixed population, if the missing rates of the three genotypes are identical in all sub-populations, then the missing pattern is MACR. If the missing rates of the three genotypes are identical within each subgroup, but the missing rates differ across subgroups, then the missing pattern appears to be MAR. Two distinct types of missingness in genotype data should be noted due to different mechanisms. The first situation is that individuals may be unavailable due to death or non-participation. Therefore, there can be different missing rates for the offspring and their parents. As a result, informative missingness could occur solely in the parents, but not the offspring. The second situation is that the genotyping assay may have failed to deliver a ‘call’ at a particular locus for a particular specimen, even though the person was participating. The scenario may depend on the true genotype (hence be informative), but may not differ across individuals. As a result, informative missingness would exist in both the offspring and parents.

In 1995, the estimated probability of transmission of certain alleles4 was pointed out to be biased in the TDT using dyads (the affected offspring with only one parent), where only heterozygous parents and homozygous offspring contributed to the test. The 1-TDT5 was free from such bias when parental genotypes were MCAR or MAR. In addition to the 1-TDT, the family-based association test by Rabinowitz and Laird6 as well as several other strategies had been proposed7, 8, 9, 10, 11 to accommodate incomplete triads. When parental genotypes were missing informatively, Allen et al.12 and Chen13 carried out valid tests to incorporate incomplete data. However, the two methods experienced substantially reduced statistical power when the underlying missing pattern was truly MCAR/MAR as discussed by Guo et al.14 Although various scenarios had been well studied, all methods5, 6, 7, 8, 9, 10, 11, 12, 13, 14 focused on the missing pattern of parental genotypes and assumed that the offspring genotypes were MCAR/MAR. When the assumption of MCAR/MAR was violated among offspring genotypes, Guo15 indicated that the TDT using only complete triads may still inflate the type-I error and/or reduce power due to ascertainment bias. This phenomenon suggests that if the missing pattern of offspring genotypes is not determined, a significant result of the TDT may not assure a true association, even if incomplete triads are excluded from the analysis. Therefore, missing mechanism is an important issue in analyzing genetic data.

The first attempt to determine whether or not parental genotypes are missing informatively was introduced by Guo et al.,14 the Test of Informative Missingness (TIM), which compared the distribution of parental genotypes in triads with that of dyads, conditional on the genotypes of affected offspring. Differential distributions of parental genotypes in triads and dyads indicated that the missing pattern of parental genotypes was not ignorable. The TIM is a valuable tool for genetic data cleaning. A novel application for the TIM was to exclude SNPs that are missing informatively in a genome-wide association study (GWAS). In this way, fewer yet more reliable SNPs will be analyzed and this procedure may effectively reduce the excessive amount of false positives in the analysis. In the era of GWAS, one million SNPs are considered as the standard. SNPs with missing rates exceeding a specific threshold are now routinely excluded, because inclusions of SNPs with higher missing rates lead to too many significant results, which are thought to be false positives. This strict enforcement of nearly complete data raises some important issues, since excessive rates of significant results in a typical GWAS may be potentially caused by informative missingness.

Although the TIM demonstrates decent power, its performance is discernibly weaker when the minor allele/genotype introduces informative missingness. Insights of such reduced power could be comprehended by the following example. Assuming that the minor (major) allele is A (C) and the corresponding allele frequency is 0.3 (0.7). The frequencies of genotypes AA, AC and CC are 0.09, 0.42 and 0.49, respectively. If 10% of the subjects with the genotype AA are missing, but only 1% of the individuals with the genotype AC or CC are not available, then the missing pattern is informative. In a random sample of 10 000 subjects, one would expect that 90, 42 and 49 subjects are missing genotypes AA, AC and CC, respectively. In contrast, if the excessive missingness (10%) occurs in the major genotype CC, but only 1% of the individuals with the genotype AA or AC are absent, then one would expect a much larger number of individuals with missing genotypes, which results in a stronger signal for informative missingness. Since the TIM is conditional of the offspring genotypes, the size of each of the three offspring genotypes has an important role when comparing the distribution of parental genotypes in triads with that of dyads. It is worth noting that the offspring with the genotype AA is the minor group and their contribution to the test statistic of the TIM is less weighted. Hence, power of the TIM is considerably reduced under such circumstances.

In this article, a new strategy, which is not conditional on the genotypes of affected offspring, is proposed to avoid the weakness of the TIM method. This novel method extends the expectation-maximization algorithm-based HRR (EM-HRR),11 which utilized all types of ascertained data (triads, dyads and monads; note that monads are affected offspring without any parent). A previous study15 had revealed that when parental genotypes were missing informatively, inconsistent LD signals were frequently observed between the EM-HRR that included incomplete data and the HRR that used only complete triads. Since the Breslow-Day test16, 17 was designed to test homogeneity of multiple odds ratios, it detects inconsistent estimates of odds ratios from the EM-HRR and HRR. Therefore, the new test of informative missingness is named as TIMBD.

Materials and methods

Following the previous work,14 let represents the observed sample size for each type of triad data. k=‘0’, ‘1’ or ‘2’ denotes the total number of B1 alleles transmitted to the offspring, and i, j=‘0’, ‘1’ or ‘2’ denotes the total number of B1 alleles for the father and mother, respectively. For example, represents the total number of triads where genotypes of the offspring, father and mother are B1B2, B2B2 and B1B2, respectively. Note that the superscript ‘*’ indicates that the parental genotype is missing. For example, represents dyads with the missing father.

Let , , and denote the total number of transmitted alleles for B1, non-transmitted alleles for B1, transmitted alleles for B2 and non-transmitted alleles for B2, respectively, where the superscript s indicates the family types (s=1 for complete triads, s=2 for dyads and s=3 for monads).

The HRR is only applicable for complete triads, where . Unlike the original EM-HRR that utilizes both complete and incomplete data, the EM-HRR statistic in this article only includes dyads and monads such that . The EM-HRR statistic uses the same proportions estimated from the original EM-HRR and detailed calculations of the HRR and EM-HRR are displayed in Table 1.

Table 1 The HRR and EM-HRR statistics for the TIMBD

Let denotes the total number of alleles obtained from complete data and denotes the total number of alleles derived from incomplete data. Assuming the absence of genetic heterogeneity, the proofs (see Appendix A for details) indicate that parental genotypes are MCAR/MAR if and only if E(HRR)=E(EM-HRR). As a result, the Breslow-Day test is implemented to detect the inequality of the HRR and EM-HRR. Here, the Mantel-Haenszel Odds Ratio is defined as:

The TIMBD is computed as:

(see Appendix B for details). Since the Breslow-Day test is available in many statistical packages, the TIMBD is not computing intensive. Under the null hypothesis of MCAR/MAR, the TIMBD has an asymptotic χ2 distribution with one degree of freedom.

It is worth noting that both the HRR and EM-HRR are robust to population stratifications, even if allele frequencies in the sub-populations are extremely different. Hence, the TIMBD, which is based on the HRR and EM-HRR, is also robust to population admixture and remains a valid test under MAR.

Simulations

To provide fair comparisons, similar simulation schemes of the TIM14 were adopted. Considering an SNP, simulations begin with the assumption that the population is under the Hardy–Weinberg Equilibrium. Let ‘a’ and ‘A’ denote the disease allele and normal allele, respectively. ‘D’ means that an individual is diseased or affected. Let ‘f’ denotes the probability of being affected when an individual carries 0 risk alleles (the phenocopy rate), and let ‘K’ denotes the genotype relative risk. For a recessive disease model, the penetrance functions are P(D∣AA)=P(D∣Aa)=f and P(D∣aa)=K × f, where 0⩽f⩽1 and 0⩽K × f⩽1. The disease prevalence is determined by these probabilities and the risk allele frequency. Similarly, for a dominant disease model, P(D∣AA)=f and P(D∣Aa)=P(D∣aa)=K × f. In addition, the confined additive model was also created as P(D∣AA)=f; P(D∣Aa)=min(K × f, 1); P(D∣aa)=min(2 × K × f, 1). The affection status of each individual was determined according to these parameters.

Several disease allele frequencies as well as marker allele frequencies were examined. A range of possible values for the disequilibrium coefficient δ and recombination fraction θ were simulated. The frequencies of the disease and marker alleles, the disease model, the phenocopy rate and the penetrance rate are indicated in each table. According to these parameters, a general population was simulated where nuclear families have exactly one offspring. Parental genotypes under the Hardy–Weinberg Equilibrium were first simulated. Then based on the Mendelian law, offspring genotypes were then generated for each household. After genotypes were simulated for every triad, the disease status of the offspring was determined by the offspring genotype, the disease penetrance rate and the phenocopy rate. The next step was to create the missing data, where the parental genotypes as well as the offspring genotypes were assigned to be absent according to various missing rates, which were clearly indicated in the tables. The last step was to randomly select probands (triads, dyads and monads) from the simulated population.

In the second set of simulations, population stratifications were considered. The previous scheme14 was adopted and two populations were sampled under the Hardy–Weinberg Equilibrium with expected samples sizes reflecting different disease allele frequencies in the two populations. For example, for a pure recessive model, if the disease allele frequencies of the two populations are 0.3 and 0.6, respectively, then 9% of the first and 36% of the second population would be affected and sampled. Therefore, one would expect 20 and 80% of the sample to come from the first and second populations, respectively. This is the ratio that one would observe in most samples with admixture. Because the disease allele frequencies are different in the two populations, the frequencies of the diseased individuals in the two samples are also different. The disease allele frequencies, the marker allele frequencies, the phenocopy rates and the penetrance rates for the two populations were indicated in the tables.

The simulations were repeated 10 000 (1000) times to examine type-I error (power) of several tests examined including the TIMBD. In general, parents of the affected offspring are difficult to recruit. Therefore, the missing rates ranged from 1 to 40% in computer simulations. Examples were the missing rates derived from the Framing Heart Study,11 where the missing rate for systolic blood pressure was as high as 91% (247/271).

In this article, power under the GWAS scenario was also examined. Assuming that one million SNPs were tested, a much large sample size of 5000 triads was considered. In addition, the most stringent correction for multiple testing was adopted. Therefore, P-values that were smaller than the Bonferonni’s adjusted α(5 × 10−8) could be declared significant. A total of 10 000 repetitions were done for the GWAS scenario.

In Tables 2, 3, 4, 5, 6, 7, the column marked ‘TDT’ reports results using the traditional TDT test on the subset of complete triads only. The column marked ‘1-TDT’ uses both the complete triads and dyads. The column marked ‘TIM’ is the test of informative missingness14 and the last column ‘TIMBD’ represents the new strategy proposed in this article. Allen et al.12 commented that the original 1-TDT should not be used. Thus, the modified 1-TDT was used, but not the original 1-TDT, in computer simulations.

Table 2 Type-I error (%) of the TIMBD in a homogeneous population assuming MCAR
Table 3 Type-I error (%) of the TIMBD under population admixture with a moderate marker allele difference assuming MCAR/MAR
Table 4 Type-I error (%) of the TIMBD under population admixture with an extreme marker allele difference assuming MCAR/MAR
Table 5 Power (%) of the TIMBD assuming no linkage ( θ =0.5) or association ( δ =0)
Table 6 Power (%) of the TIMBD assuming linkage (θ=0) and association (δ=0.1)
Table 7 GWAS scenarios—Power (%) of the TIMBD after Bonferroni’s correction for multiple testing (α=10−8) assuming no linkage (θ=0.5) or association (δ=0)

Results

Type-I error

When the missing pattern was MCAR for any member of the triads, type-I errors of the TIMBD in a homogeneous population are displayed in Table 2. The disease and marker allele frequencies were 0.3 and 0.4, respectively. The disease penetrance and phenocopy rate were 0.4 and 0.2, respectively. Different disease and marker allele frequencies, penetrance rates and phenocopy rates yielded similar results, which were not shown in the tables. The underlying disease model was indicated in the first column. The second and third columns were the recombination fraction (θ) and disequilibrium coefficient (δ). The three missing rates for the father, mother and offspring were displayed, respectively, in the first, second and third number of the parenthesis in the fourth column. The three missing rates may be different. However, each of the three missing rates was identical for all genotypes, B1B1, B1B2 and B2B2, such that the missing patterns were considered as MCAR for each family member. When there were no linkage (θ=0.5) or association (δ=0), the TDT and 1-TDT showed the expected 5% chance of rejecting the null hypothesis in the upper nine rows of Table 2. When there is linkage (θ=0) and association (δ=0.14), power of the TDT and 1-TDT are displayed in the bottom nine rows of Table 2. Simulation results indicated that type-I errors of the TIM and TIMBD were less than the nominal level of 5%, regardless of the relationship between the marker and the disease alleles. Therefore, test statistics of the TIM and TIMBD were independent of the recombination fraction θ and disequilibrium coefficient δ.

When the parental and offspring genotypes were MCAR or MAR in an admixed population, type-I errors of the TIM and TIMBD were displayed in Tables 3 and 4. The disease penetrance and phenocopy rates were 0.4 and 0.2, respectively. This scenario implies that the genotype relative risk was 2. Higher or lower genotype relative risk yielded similar comparisons and the results were not shown. In Tables 3 and 4, the disease allele frequencies of the first and second populations were 0.2 and 0.6, respectively. Therefore, the degree of admixture was identical in Tables 3 and 4. However, in Tables 3 and 4), the minor marker allele frequencies for the first and second populations were 0.4 and 0.3 (0.6 and 0.2), respectively. Hence, the difference between the marker allele frequencies of the two populations was more extreme in Table 4 than that in Table 3. When the disease and marker allele frequencies were <0.2 or >0.6, the comparisons between the TIM and TIMBD were similar and the results are not shown.

Since the TDT and 1-TDT were robust to population stratifications, both methods demonstrated the expected 5% type-I errors when there was no linkage or association in the upper nine rows of Tables 3 and 4. Under the alternative hypothesis, the TDT showed the lowest power due to exclusions of dyads in the analysis in the bottom nine rows of Tables 3 and 4. Therefore, if the missing pattern was MCAR/MAR, then the 1-TDT was more powerful than the TDT for detecting LD, which matched previous reports by Sun et al.5 and Guo et al.11 Although the TIM performed well under population admixture and showed type-I errors <5% in Table 3, its type-I errors could be slightly inflated over 8% in rows 3, 6, 9, 12, 15 and 18 of Table 4. In both scenarios, type-I errors of the TIMBD did not exceed 5%, although it appeared conservative. Therefore, the simulation results revealed that the TIMBD was robust to population admixture, while the TIM may suffer slightly inflated type-I errors.

Notes of the Breslow-Day test18 indicated its requirement of large sample sizes in stratums and behavior under ‘small stratum’ settings that introduced the conservative type-I error. In the simulations, the sample size of the EM-HRR (i.e., missing data stratum) was not large to reflect real life scenarios, where the proportion of missing data was not too high. Therefore, the type-I error of the TIMBD was slightly conservative. Note that the average marker allele frequency in Table 3 was higher than that in Table 4. As a result, the type-I error of the TIMBD decreased from Table 3 to Table 4. In other words, decreasing marker allele frequencies introduced more conservative type-I errors of the TIMBD and such pattern matched the previous results.18 Regardless, the TIMBD did not yield the inflated type-I error and remained a valid test, even if the sample sizes in some stratums were small.

Power

Simulation results displayed in Table 5 (no association (δ=0) or linkage (θ=0.5)) and in Table 6 (association (δ=0.1) and linkage (θ=0)) were circumstances under which genotypes of triads were missing informatively in a homogeneous population. The disease and marker allele frequencies were 0.3 and 0.4, respectively. The disease penetrance and phenocopy rate were 0.4 and 0.2, respectively. The following two scenarios were examined: (1) the odd rows in each disease model: informative missingness occurred solely in parents but not the offspring. One can see that the missing rate (15 or 10%) was identical for any offspring genotype; (2) the even rows in each disease model: informative missingness occurred in both offspring and parental genotypes.

In Table 5, the TDT using the subset of complete triads remained a valid test for LD under the first scenario (the odds rows), since the TDT revealed type-I errors approaching the 5% nominal level. However, the 1-TDT, which used both triads and dyads, showed the inflated type-I errors and the inflation increased with respect to the magnitude of informative missingness. Under the second scenario, the TDT and 1-TDT were no longer valid tests, but the 1-TDT was less inflated than the TDT. In either scenario, the TIMBD was consistently more powerful than the TIM and the difference was more discernible when informative missingness was introduced by the minor allele (B1)/genotype (B1B1) (rows 3, 4, 7 and 8 in each disease model).

In Table 6, power of the 1-TDT was lower (higher) than the TDT, when the major (minor) genotype introduced the informative missingness. This fact suggested that including dyads in the analysis could either dampen or inflate power of the 1-TDT when the assumption of MCAR/MAR was violated, which matched the previous investigations.19, 20 The results revealed an important message that informative missingness could also prevent discoveries of putative disease genes.

The GWAS scenarios assuming no linkage or association were displayed in Table 7. The results were adjusted by the Bonferroni’s correction for multiple testing (the adjusted α=5 × 10−8). The TIMBD demonstrated decent power in the GWAS scenarios. The results also revealed that the TDT could yield considerable false positives (the second row in each disease model), even if the correction for multiple testing was implemented. This phenomenon illustrated the relationship between excessive false positives and informative missingness in the GWAS analysis.

Discussion

Unlike the TIM, which is conditional on the offspring genotypes, the novel strategy TIMBD detects informative missingness by inconsistent LD signals between the complete and incomplete data. Attributable to its family-based design, the TIMBD is robust to population stratifications and outperforms the TIM in most situations. The excessive false positives solely due to informative missingness were also observed in the GWAS scenarios. The TIMBD is applicable for general pedigrees, when independent triads, dyads and monads are identified from the independent pedigrees (see Supplementary data for the application in SAS/STAT software, SAS Institute inc., Cary, NC, USA).

In addition to non-random genotyping failure, which introduces informative missingness in both the offspring and parents, informative missingness may occur due to death or refusal to participate related to the outcome. One example to consider is asthma,21, 22 which could be diagnosed in both children and adults. The other plausible scenario is informative missingness in the parents, but not the offspring, as seen in age-dependent diseases, such as cancer,23 Parkinson’s disease,24 diabetes25 and cardiovascular diseases.26, 27 Same as the TIM, the limitation of the TIMBD is that it could not detect informative missingness that exists solely in the offspring, but not the parents, which could be classified as ascertainment bias. However, the TIMBD could be the foundation and/or step stones for considering ascertainment bias in genetic studies, since it could determine whether or not parental genotypes are missing informatively.

It is worth noting that the HRR and EM-HRR are based on the 2 × 2 contingency tables, hence the TIMBD could be easily extended into the logistic regression framework to adopt the Breslow-Day test in the logit model. In this way, the TIMBD could adjust for covariates related to missingness and ensures a valid test under various conditions of MAR as discussed previously.14