Introduction

Genomic imprinting and maternal effects are both important epigenetic factors that have been explored as potential sources of heritability unexplained by genome-wide association studies. Genomic imprinting and maternal effects are involved in many complex human diseases, including Prader–Willi, Beckwith–Weidemann, and Angelman syndromes [1], and childhood cancers [2]. In this paper, we focus on developing an extension to a statistical method so that it is more powerful for detecting imprinting and maternal effects; therefore, we only describe these two epigenetic factors briefly that is sufficient for the purpose of explaining our work. Complete and detailed descriptions of their complex mechanisms can be found in the literature [3,4,5]. Genomic imprinting refers to the process of differential epigenetic DNA modifications of the parental alleles, which leads to unequal expression of a heterozygous genotype depending on whether the variant allele is inherited from the mother or from the father [3]. Genomic imprinting may vary among individuals (a phenomenon termed polymorphic imprinting) depending on the stages of development, tissues, genetic background, and environment [4]. A maternal genotype effect, on the other hand, is a phenomenon wherein the phenotype of an individual is influenced by the genotype of the mother, not merely the allelic copy inherited. Maternal effects usually occur due to the additional mRNAs or proteins passed from the mother to the fetus during pregnancy and remained in the child after birth. Although imprinting and maternal effects arise from two different biological processes, their effects can mask one another [5]. Thus, it is important that these two confounding effects are considered together using statistical methods to avoid false positives/negatives ([6] and references therein).

A number of statistical methods have been proposed to study imprinting and maternal effects jointly, dated back to the work by Weinberg et al. [7]. Nevertheless, this and most of the work that followed make the assumption of mating symmetry to avoid overparametrization so that hypothesis testing on the existence of imprinting and/or maternal genotype effects can be carried out [6]. The most frequently used design for studying maternal and imprinting effects is that of complete triads, where the genotypes of mother, father, and child are all required to be observed [8] to facilitate the identification of the parental origin of the allele of interest. In practice, incomplete triads are common, as one parent may be unavailable for genetic study: parents may be deceased or unavailable; they may refuse to participate in the study; or the father may need to be excluded post hoc due to nonpaternity. Although the need for complete triad data has been relaxed in more recent methods so that mother–child pairs may also be included in a study, data are nonetheless not fully utilized [9,10,11].

Among the methods that allow for missing father genotypes, LIME, a partial Likelihood inference method for detecting both imprinting and maternal effects, stands out as it does not require assumptions about mating type probabilities [11]. Furthermore, nonproband siblings may also be included in a study in addition to the child proband in triads and pairs. However, when a father’s genotype is missing, LIME has to ignore pairs in which both the mother and the child’s genotypes are heterozygous, as parental origin of the child’s allele cannot be determined and thus the parameters of interest and the nuisance parameters are not separable.

While Yang and Lin [11] demonstrated that ignoring mother–child pair both with heterozygous genotypes may not lead to a significant loss of information in some scenarios for LIME, incorporating genotype information for markers that are in linkage disequilibrium (LD) with the test locus can generally help infer parental origin [12,13,14]. As such, we propose LIMEhap, an approach that adopts the methodology of LIME, but utilizes additional information from nearby markers that are in LD with the test locus to help infer missing genotypes. Therefore, LIMEhap can detect both imprinting and maternal effects and distinguish between them by using all the complete triad data and incomplete mother–child pair data when available. To substantiate the validity of the proposed procedure and to quantify the potential gain in power, we carried out an extensive simulation study by considering 32 settings. Our results show that, by inferring the parental origin of the minor allele at the test locus through borrowing genetic information from nearby loci, the power of the proposed method can be close to that using complete genotype data at the test locus with well-controlled type I error rate. This illustrates that the use of nearby marker in LD with the test locus helps resolve parental origin ambiguity to a great extent, consistent with observations in the literature. We further study the robustness of LIMEhap to violation of Hardy–Weinberg Equilibrium (HWE), as inference on the genotype distribution is made under the assumption of HWE. Finally, we applied LIMEhap to an autism spectrum disorder dataset to demonstrate its practical utility.

Materials and methods

Enrichment of test locus information

Suppose a disease locus has two alleles A and a, where a (typically the minor allele, that is, with a smaller frequency than A) is the variant allele. In a nuclear family, we use F, M, and C to represent the genotypes of father, mother, and a child, respectively. The genotypes AA, Aa, and aa are coded as 0, 1, or 2, respectively, representing the number of the variant allele. We also use 1m to indicate a child with genotype Aa whose minor allele a is inherited from the mother.

We use a multiplicative risk model for the disease penetrance:

$$P(D \,=\, 1|M,F,C) =\, \, \delta \,R_1^{I\left( {C \,=\, 1} \right)}R_2^{I\left( {C \,=\, 2} \right)}R_{im}^{I\left( {C \,=\, 1_m} \right)}\\ S_1^{I\left( {M \,=\, 1} \right)}S_2^{I\left( {M \,=\, 2} \right)},$$
(1)

where δ is the phenocopy rate; R1 and R2 denote the effects of one or two copies of an individual’s own variant allele, respectively; Rim denotes the imprinting effect; S1 and S2 denote the effects of one or two copies of the mother’s variant allele, respectively; and finally, D is an indicator variable denoting the disease status of the child, with 1 being affected and 0 being unaffected. As shown in Yang and Lin [11], all the parameters are identifiable and estimable; therefore, we can distinguish maternal effects from imprinting effects using the model as specified in (1). We denote the vector of parameters of interest by \(\theta \,=\, \left( {\delta ,R_1,R_2,R_{im},S_1,S_2} \right)^{\it{ \top }}\) and the vector of nuisance parameters (including mating type probabilities, to be elaborated in the following) by ϕ. We note that this model is assumed to apply to all individuals; therefore, the methodology is expected to suffer from power loss in the presence of polymorphic imprinting.

When the father’s genotype is missing, there are seven genotype combinations of the (M,C) pair, listed in Table 1. Therein, the μ’s are mating type probabilities; that is, μij = P (M = i, F = j),i, j = 0,1,2. As we can see from the table, for the first six categories, the mating type probabilities (the nuisance parameters) and the risk parameters can be completely separated as multiplicative factors in the joint probabilities of mother–child genotypes and child’s disease status (either D = 1 or D = 0). However, for the 7th category, (M,C) = (1,1), separation cannot be achieved. This is because when both mother and child are heterozygous, the parental origin of the child’s variant allele cannot be inferred when father’s genotype is missing. As such, LIME ignores data from this category, with a justification given in Yang and Lin [11], where they demonstrated that the loss of information with the ignorance of this category is not substantial in some situations. Nevertheless, it is important to recover as much information as possible, especially given the scarcity of family data compared with population case-control samples. For brevity in the main text, a detailed description of the LIME procedure is delegated to Supplementary Material S1. In the remaining of this subsection, we describe a method that uses haplotype information in the genomic region spanned by the test locus and nearby markers in LD with it to impute the missing father’s genotype so that such mother–child pairs can also be used.

Table 1 Joint probabilities of disease status and genotypes, with further expansion of the (M,C) = (1, 1) category and sample sizea for each category.

For ease of exposition, we explain the idea in a simple scenario with only two loci. The method for multiple loci proceeds analogously. Suppose the test locus has two alleles, A and a, and a nearby marker also has two alleles, B and b. We assume that these two loci are in high LD so that there is no recombination between these two loci when transmitting the chromosomal segment from parents to offspring. There are four possible haplotypes formed by these two loci: AB, Ab, aB, and ab. Suppose in a triad, the mother’s unphased genotype is (Aa,Bb), where the first entry in the parentheses denotes the genotype at the test locus and the second denotes the genotype at the marker locus. For the child, the genotype is (Aa,BB). Then at the test locus, (M,C) = (1,1). Suppose further that haplotype ab occurs with a negligible frequency, then one can safely assume that the mother’s phased two-locus genotype is Ab|aB, where the haplotype before and after the “|” make up the haplotype pair of the mother. Since no recombination is allowed between these two loci, she must have passed aB to the child since the child has genotype BB at the additional marker locus. Therefore, the parental origin of the a allele in the child must be from the mother.

In a more complex scenario, we assume that all four haplotypes have frequencies that are too high to be ignored. In this situation, we still cannot determine the parental origin of the a allele even with additional information from the marker locus. For such cases, we will infer all compatible haplotype configurations for the family and compute their corresponding probabilities, as shown schematically in Table S1 and explained in the caption. By doing so, methods that cannot make use of the pair data with (M,C) = (1,1) at the test locus, such as LIME, can now borrow information from nearby markers to estimate the distribution of the familial genotype configurations, which provides a probabilistic approach to incorporate the partial information into these methods.

LIMEhap

In this section, we introduce LIMEhap, an extension of LIME, to make use of the (M,C) = (1,1) category based on inferred haplotype distributions. Pairs with (M,C) = (1,1) are processed first. Utilizing information from loci in LD with the test locus and assuming that there are no recombinations in the region spanned by these loci, we infer compatible haplotype configurations and their corresponding probabilities for the nuclear family (which may include additional siblings) using HAPLORE [15]. Since HWE is a necessary assumption in HAPLORE, we will explore the robustness of LIMEhap when the HWE assumption is violated.

Define \(n_{mc}^1\) and \(n_{mc}^0\) as the numbers of case-mother and control-mother pairs, respectively, with M = m and C = c. The child in each of these pairs is a proband (either affected with D = 1 or unaffected with D = 0). Define \(sn_{mc}^1\) and \(sn_{mc}^0\) similarly for sibling-mother pairs, where these siblings are not probands. We assume that the total number of families derived from case-mother pairs,\(n_p^1 \,=\, \mathop {\sum}\nolimits_{m,c} {n_{\left( {mc} \right)}^1}\), and the total number of families derived from control-mother pairs, \(n_p^0 \,=\, \mathop {\sum}\nolimits_{m,c} {n_{(mc)}^0}\), are fixed. Then the likelihood from the observed data can be written, up to a proportionality, as

$$L\left( {\theta ,\phi } \right) \propto \, \left\{ \mathop {\prod}\nolimits_{m,c} P\left( {m,c{\mathrm{|}}D \,=\, 1} \right)^{n_{mc}^1}P\left( {m,c{\mathrm{|}}D \,=\, 0} \right)^{n_{mc}^0} \right. \\ \left. \,\,\, P\left( {D \,=\, 1{\mathrm{|}}m,c} \right)^{sn_{mc}^1}P\left( {D \,=\, 0{\mathrm{|}}m,c} \right)^{sn_{mc}^0} \right\}.$$

Note that in the above likelihood, for each case-mother or control-mother pair, the contribution is through a conditional probability conditioning on their affection status (first two factors in the likelihood), whereas for the additional siblings, their contributions are factored into the likelihood prospectively since they are not probands (last two factors in the likelihood).

Following the argument of LIME, one can extract the following partial likelihood that is free of the nuisance parameters:

$$L_p\left( \theta \right) \,= \,\, \left\{ \mathop {\prod}\nolimits_{m,c} {\left( {p_{mc}} \right)^{n_{mc}^1}\left( {1 \,-\, p_{mc}} \right)^{n_{mc}^0}} \right. \\ \,\,\left\{ {\mathop {\prod}\nolimits_{m,c} {\left( {q_{mc}} \right)^{sn_{mc}^1}\left( {1 \,-\, q_{mc}} \right)^{sn_{mc}^0}} } \right\}, $$
(2)

where pmc and qmc for (m,c) ≠ (1,1) are as described in Supplementary Material S1, and we focus on defining these two quantities for (m,c) = (1,1) in the following. We first note that in the original LIME method [11], the likelihood and partial likelihood do not include the (m,c) = (1,1) category. Nevertheless, their argument for extracting the partial likelihood, now including all pairs, remains valid, although the corresponding p11 and q11 are defined differently and explained in the following.

We let \({\it{n}}_{11} = n_{11}^1 + n_{11}^0\) be the total number of nuclear families with (m,c) = (1,1) and \(n_{1f1h}^1\) and \(n_{1f1h}^0\) be, respectively, the summation of the inferred probabilities of case-mother and control-mother pairs with triad genotype configuration (m,f,c) = (1,f,1) at the test locus derived from HAPLORE as explained in subsection of enrichment of test locus information, where f may take the value of 0,1, or 2. Following the original idea of LIME [11], we use p1f1 (θ) to represent the probability of being a case-parent triad among all families (including all case- and control-families) with (m,f,c) = (1,f,1), f = (0,1,2), and is defined as

$$p_{1f1}\left( \theta \right) = \,\, \frac{1}{{s_{1f1}}}\frac{{n_p^1P\left( {D \,=\, 1|M \,=\, 1,F \,=\, f,C \,=\, 1} \right)}}{{P(D \,=\, 1)}},\\ f \,= \; 0,1,2,$$

where

$$s_{1f1}(\theta ) =\, \, \frac{{n_p^1P\left( {D \,=\, 1|M \,=\, 1,F \,=\, f,C \,=\, 1} \right)}}{{P\left( {D \,=\, 1} \right)}}\\ \, +\, \frac{{n_p^0P\left( {D \,=\, 0|M \,=\, 1,F \,=\, f,C \,=\, 1} \right)}}{{P\left( {D \,=\, 0} \right)}}$$

is also free of the nuisance parameters. Then, the probability of being a case-mother pair among families with (m,c) = (1,1), p11 (θ), is the weighted average of the p1f1’s with the weight for each f (=0,1,2) being the proportion of families with (m,f,c) = (1,f,1): \(\left( {n_{1f1h}^1 \,+\, n_{1f1h}^0} \right)/n_{11}\). That is,

$$p_{11}\left( \theta \right) \,=\, \,\,p_{101}\frac{{n_{101h}^1 \,+\, n_{101h}^0}}{{n_{11}}} \,+\, p_{111}\frac{{n_{111h}^1 \,+\, n_{111h}^0}}{{n_{11}}}\\ +\, p_{121}\frac{{n_{121h}^1 \,+\, n_{121h}^0}}{{n_{11}}}.$$
(3)

Similarly, we define \(sn_{1f1h}^1\) and \(sn_{1f1h}^0\) as, respectively, the summation of the inferred probabilities of affected or unaffected sibling-mother pairs with triad genotype configuration (m,f,c) = (1,f,1) at the test locus derived from HAPLORE. Then the penetrance probability for (m,c) = (1,1) is calculated as the weighted average of disease penetrance q1f1(θ) = P(D = 1|M = 1, F = f, C = 1) with the weight for each f(=0,1,2) being the proportion of the additional siblings with (m, f, c) = (1,f,1): \((sn_{1f1h}^1 \,+\, sn_{1f1h}^0)/sn_{11}\). That is,

$$q_{11}(\theta ) =\, \, q_{101}\frac{{sn_{101h}^1 \,+\, sn_{101h}^0}}{{sn_{11}}} \,+\, q_{111}\frac{{sn_{111h}^1 \,+\, sn_{111h}^0}}{{sn_{11}}} \\ +\, q_{121}\frac{{sn_{121h}^1 \,+\, sn_{121h}^0}}{{sn_{11}}}.$$

It is noteworthy to point out that the above idea for taking care of the (M,C) = (1,1) category is in fact parallel to the way that LIME handles the (M,F,C) = (1,1,1) category, where the uncertain origin of the a allele is being attributed to father or mother, each with probability 0.5. Thus, the disease penetrance is a weighted average of the two possibilities with the weights being their respective probabilities (50% for each). For LIMEhap here, we have three possibilities, and we use the inferred probabilities from HAPLORE as the weights.

Simulation study

We carry out a simulation study to evaluate the performance of LIMEhap and compare its information gain over LIME. To generate data that reflect LD structure in real situations, we consider 5 SNPs in the GPX1 gene (shown in Supplementary Table S2), where the underlying haplotypes have been previously constructed and used in other studies, including those that investigated the properties of methods for detecting parent-of-origin effects [13, 16, 17]. We consider four scenarios, which are combinations of two levels of population disease prevalence P(D = 1) (PREV) {0.05, 0.15}, and two levels of HWE {not hold = 0, hold = 1}. The probabilities of a genotype taking values of 0, 1, and 2 are (1 – p)2(1 – ζ) + (1 – p)ζ, 2p(1 − p)(1 − ζ), and p2(1 – ζ) + pζ, where ζ is the inbreeding parameter [18]. When HWE holds, ζ = 0. When HWE does not hold, ζ is set to be 0.1 and 0.3 for males and females, respectively.

To facilitate this investigation, we consider a total of eight disease models (Table 2). The first three models portrait no imprinting nor maternal effects. Model 4 has maternal effect only, models 5 and 6 have imprinting effect only, and model 7 and 8 have both parent-of-origin effects. With the specification of each scenario and a disease model, the penetrance probability in (1) is fully specified. In all, we consider a total of 32 settings (four scenarios and eight disease models).

Table 2 Parameters for eight disease models used in the simulation study.

We generate 1000 replications under each of these 32 settings. Each replicate consists of 150 case families and 150 control families. Firstly, parental haplotypes are generated. Then, the haplotypes of their proband children are created according to the transmission probability assuming no recombination. Affection status D of the probands are determined by a Bernoulli trial, with the success probability calculated based on (1) with M,F,C as mother, father, and child’s genotypes at the test locus, which are deduced completely from the generated haplotypes. A family with an affected child is recruited as a case family, whereas a family with an unaffected child is recruited as a control family. The process of generating M, F, C, and D is repeated until we have collected sufficient numbers of case and control families to meet the preset sample size. This process returns triad data with affected or unaffected probands. To generate pair data, we remove all the fathers’ genotype information from the triad data. In addition, we also consider the scenario with one additional nonproband sibling, whose haplotype is generated based on the parents’ haplotypes and transmission probability as for the probands, and whose disease status is assigned prospectively based on (1). We denote P and T as pair data without the (M,C) = (1,1) category and triad data, respectively. We further denote P + 1 and T + 1 as pair and triad families, each with one additional sibling. On the other hand, we use P + hap, P + 1 + hap to represent the corresponding data that include the (M,C) = (1,1) pairs.

Real data analysis

Autism spectrum disorder (ASD) is a serious neuron-developmental disorder that impairs the ability to communicate and interact, typically starting in childhood [19, 20]. Parents usually notice signs in the first two years of their child’s life. The disease is highly heritable, but the underlying genetic determinants are yet fully understood [21, 22]. Some studies have amassed evidence that suggests the involvement of parent-of-origin effects, including both imprinting and maternal effects [23,24,25]. To uncover the genetic architecture of ASD, the Autism Genome Project (AGP) Consortium investigated 2611 nuclear families. We obtained the genotype and phenotype data of 4222 individuals with pedigree information from dbGAP (Accession: phs000267.v1.p1) [26]. In our application of LIMEhap, we focus on the 41,940 SNPs on chromosome 7, as this chromosome contains more than 50 genes that have been implicated to be associated with ASD in the literature. One franking marker on each side were used for each test locus. In addition, as ignoring structure due to differential ancestry can lead to an excess of spurious findings and reduced power, we only utilize families with European ancestry (the indicator variable for European ancestry is available). About 88% of the individuals in the data belong to this category, which come from 1193 nuclear families including 1177 affected and 32 unaffected offspring.

Results

Simulation study

Type I error and power

We applied LIMEhap to data types P + hap and P + 1 + hap. For comparison, we also applied LIME to the P, P + 1, T, and T + 1 data types. Although it is expected that the results from LIMEhap is somewhere between the corresponding P and T data, we are interested in assessing how much information can be gained using P + hap over P, or P + 1 + hap over P + 1, and how close P + hap or P + 1 + hap can get to T or T + 1 respectively. For our first set of analyses, we use SNP1 as the test locus and the rest of the SNPs to infer haplotype. Figure 1 shows the type I error rate and power when HWE does not hold and PREV = 0.05. We can see that all the type I error rates are close to the nominal values of 0.05, even when HWE does not hold. This shows that even though HAPLORE infers haplotype based on the HWE assumption, LIMEhap is still robust to violation of such an assumption, inheriting the property from the original LIME. Compared with LIME, LIMEhap obtains much greater power for detecting imprinting effect for all four non-null models and comparable power for detecting association and maternal effects. The numerical values for the type I error and power are given in Supplementary Tables S49. In all cases, for detecting imprinting effects, the inclusion of the additional information from the (M,C) = (1,1) category cuts down the information loss (from not having complete triad data) by at least half. The reason that there is a drastic increase in power for detecting imprinting effect is that using markers in LD with the test locus helps resolve, even if not completely, the parental origin of the variant allele in the child to a great extent. The information increase is nontrivial since, typically, more than 10% of the families fall into the (M,C) = (1,1) category. Similar results for type I error and power are obtained for the other three scenarios of HWE and PREV combinations and are delegated to Supplementary Figures S13.

Fig. 1: Type I error and power for eight disease models (labeled 1–8 corresponding to the numbering in Table 2) for the scenario in which HWE does not hold and PREV = 0.05.
figure 1

The top segment gives the outcome for association, in which the results for model 1 represent type I errors while the rest are power. The middle segment provides the outcome for imprinting, in which the results for models 1–4 represent type I errors while the rest are power. The bottom segment presents the outcome for maternal effect, in which the results for models 1–3, 5–6 represent type I errors while the rest are power. For each model, the results for six types of data utilization are given in the order provided in the caption, and note that P + hap and P + 1 + hap give the analysis results using LIMEhap and the rest using the original LIME.

In reality, there might be a mixture of triad data and pair data. To simulate this scenario, we also randomly set a father’s genotype to be missing with probability 0.5 in each case-parent families and 0.7 in each control-parent families. We then apply the two-step LIMEhap procedure to these resulting mix datasets. Results in Supplementary Figures S47 and Tables S1013 show that LIMEhap also controls type I error well and gains considerable power for detecting imprinting effect.

Position of test locus

The results presented above all use the 1st SNP in the 5-SNP block as the test locus. To explore whether detection power can be substantially influenced by the position of the test locus relative to the additional loci, we also position the test locus to be at the 2nd, 3rd, 4th and 5th SNP, and we use the rest of the SNPs as additional loci for inferring fathers’ genotypes. The power for detecting imprinting effect, presented in Fig. 2, is for the scenario where HWE holds and PREV = 0.05. From the figure, we first see that the power for using the additional markers is increased regardless of the test locus position, except for when the test locus is at the last position. To understand the result when the last SNP (SNP 5) is the test locus, we first note that the SNP is the most informative, or nearly the most informative; therefore, there is limited additional information that can be gained by utilizing information from the other SNPs. Further, the last SNP is in fact in low LD with the other SNPs (Supplementary Table S3). Perhaps the most likely reason for lack of increase in power in this instance is due to its minor allele frequency (MAF) being almost equal to 0.5 (Supplementary Table S2), rendering less certainty about the familial haplotype configurations even with additional information from nearby loci. This argument is consistent with that made in another study [13]. Similar results for the other three scenarios of HWE and PREV combinations can be found in Supplementary Figures S810.

Fig. 2: Power for imprinting effect at different test locus when HWE holds and PREV = 0.05.
figure 2

For each locus, the genotype information from the rest of the other four loci is used to help resolve the familial haplotype ambiguity. Only results from Models 5–8 are presented here since there are no imprinting effects for models 1–4.

Although one might have expected the greatest information gain to be when the test locus is in the middle of the LD block (i.e., when SNP 3 is the test locus), this is however, not the case. Even though this may be counter-intuitive, the results in fact is not surprising upon careful inspection of Fig. 2 and Supplementary Figures S810, Tables S2 and S3, and using the same line of arguments as for SNP 5. It turns out that SNP 3 is either the most informative or nearly the most informative, in par with SNP 5; hence, as discussed before, utilizing the flanking SNPs will only gain limited additional information. Further, like SNP 5, it has low LD with the rest of the four SNPs (Supplementary Table S3). However, compared with SNP 5, the MAF of SNP 3 is 0.28 (much less than 0.5); thus, using the flanking makers did lead to some, albeit limited, power gain over using pair data only. On the contrary, the most information gain is in the situation where the test SNP itself is less informative and has a relatively small MAF, such as the 2nd or the 4th SNP. Using SNP 2 as an example, which has a MAF of 0.18, we examined whether additional markers, beyond the flanking ones, will necessarily lead to greater information for inferring parental origins of the minor allele. The results in Supplementary Figure S23 show that there was very substantial power gain for detecting imprinting effect when the flanking markers SNP 1 and SNP 3 were used (note that the LD between SNP 1 and 2 was 0.78). However, inclusion of additional loci (adding SNPs 4 and 5 one at a time) in this situation did not lead to further power gain. Taken together, we have demonstrated that the position of the test locus relative to the additional loci is not the most important factor for determining information gain. Rather, whether substantial information will be gained by using additional markers depends not only on its LD profile with the other markers, but also on its own informativeness and its MAF. Further, it appears that the use of the two immediate flanking markers is sufficient for helping to resolve the parental origin of the minor allele in a child. Including additional markers does not seem to be necessary.

The two-step procedure of LIMEhap is practical, but note that the disease model and affection status of probands are ignored in the first step, which may lead to bias in estimating the effect size in the second step. To quantify this concern, we checked the mean value of the estimates for each parameter under different settings, and found that the empirical means are all very close to the corresponding true values, alleviate the concern regarding bias, although outliers exit (Supplementary Figures S1118). However, note that this is not a unique problem of LIMEhap; rather, it is a general phenomenon in LIME approaches when the sample size is limited [27].

Our above results are all based on a balanced design, where the number of case families is the same as the number of control families. To explore the impact of a unbalanced design, we also performed a small simulation study with 210 case families and 90 control families. We can see that LIMEhap still has type I error close to the nominal value and much higher power for detecting imprinting effects than that of LIME (Supplementary Figures S1922). Nevertheless, the absolute power is lower compared with that with the same total number of 300 families of a balanced design.

Real data analysis

Using LIMEhap, we identified a number of SNPs that have potential association, imprinting, or maternal effects at the 0.05 significance level after the Bonferroni correction. We then checked for violation of the HWE assumption at the 5% nominal level, and none of these significant SNPs failed the test. Although LIME is not susceptible to deviation from HWE, and the second-step of LIMEhap has inherited this property, we nonetheless still test HWE for all markers since such is assumed in HAPLORE. Our results show that some of the estimated effects are very large, which could very well be due to the small number of control families [27]. Thus these results need to be further scrutinized before they can be reported confidently, and we chose only to report SNPs with reasonable effect sizes. This issue is further elaborated in the Discussion Section. In the following, we focus on discussing SNPs that are found to be significant by LIMEhap (at the Bonferroni-corrected 5% level) and have also been implicated in the literature previously.

The top segment of Table 3 presents the SNPs that have been found to have potential imprinting effects by LIMEhap and have met the above general criterion. Specifically, SNPs rs1608628 and rs10247167 both fall in the SLI4 region, which is related to specific language impairment according to OMIM (http://www.omim.org/). It has been shown in Ruser et al. [28] that impaired communication is part of the broader autism phenotype, especially among male family members. SNPs rs4729043 and rs917325 are within gene HDAC9, which has been identified to be associated with ASDs [29]. SNPs rs1978201, rs4718958, and rs2299456 are within the autism susceptibility loci AUTS1, AUTS2, and AUTS9, respectively (OMIM). SNPs rs1567277 and rs854721 are within gene PPP1R9A, which has been examined as a candidate gene for ASDs [29]. Gene PCLO, which includes SNP rs7807790, is identified to be associated with ASDs in the Chinese Han population [30].

Table 3 SNPs inferred from LIMEhap with potential association, imprinting, or maternal effects on ASD and have been implicated previously in the literature.

For imprinting effects, three SNPs have been found to meet the criterion; that is, their Bonferroni-corrected p value are smaller than 0.05 and they have been previously discussed as having an effect on ASD by other investigators (middle segment of Table 3). In fact, these SNPs all fall within the SLI4 gene region. The SNPs that were found to have potential maternal effects by LIMEhap and have also been implicated in the literature are provided in the bottom segment of Table 3. To elaborate, SNPs rs3807843 and rs6958145 are both in gene ICA1, while Salyakina et al. [31] provides evidence for the involvement of ICA1 on 7p21.3 in ASDs. SNP rs17347159 falls in the gene HDAC9. SNP rs3734989 is within PLXNA4. Suda et al. [32] found decreased expression of axon-guidance proteins such as PLXNA4 in the brains of people with autism. Both SNPs rs41567 and rs2107829 are within gene PCLO. The Study by Fenster and Garner [33] suggests that the alterations in the expression of Piccolo or the PCLO gene could contribute to developmental disabilities and mental retardation. SNP rs11977905 is within the gene AGMO, where a rare CNV in the AGMO (TMEM195) gene has been identified with autism [34]. SNPs rs10225065 and rs105024216 are both in autism susceptibility loci AUTS1. SNPs rs6956114 and rs12154389 are both in autism susceptibility loci AUTS9. SNPs rs10488060, rs11981093, and rs13228314 are all in SLI4 gene region.

Discussion

In this article, we propose a two-step approach, LIMEhap, to improve upon a recent partial likelihood inference method, LIME, for detecting imprinting and maternal effects. The improvement is to make fuller usage of mother–child pair data by utilizing all pairs regardless of their genotypes instead of discarding certain pairs with ambiguous parental origin of the variant allele. Our simulation shows that LIMEhap has empirical type I error rates close to the nominal value and achieves higher power than the original LIME. Further, the position of the test locus relative to the additional loci in the LD block does not appear to be important. On the other hand, the MAF of the test locus and its informativeness do influence the extent of power gain. Moreover, there may be little additional information gain for delineating the parental origin beyond the usage of the immediate flanking makers of a test locus.

Although the first step of LIMEhap requires the HWE assumption, our simulation results show that it is robust to HWE violation, at least for the 32 settings considered. Note that the second step of LIMEhap does not assume HWE, as no nuisance parameters are in the partial likelihood. Nevertheless, it would be advantageous for the haplotype frequency inference procedure in the first step to be free of the HWE assumption as well to ensure that the entire LIMEhap procedure is robust to such violation regardless of the underlying setting. To this end, the algorithm described in Kong et al. [35] is a potential choice, as it does not require HWE, although extended pedigrees are needed in order to infer phase information.

From the results of our analysis of the AGP data using LIMEhap, we see that the gene region SLI4 was repeatedly implicated to have potential association, imprinting, and maternal effects with ASD. Although there was an abundance of findings of association between SLI4 and ASD in the literature, little were said about imprinting nor maternal effects in previous investigations. However, although not directly supporting the evidence of epigenetic effects, according to OMIM (http://WWW.omim.org/), the SLI4 region is related to specific language impairment, which is part of the broader autism phenotype especially for male family members. Not withstanding the possibilities of false positives, the considerable number of novel findings of potential maternal and imprinting effects are likely the consequence of increased power by making use of all available data and by considering the joint effects of both factors to diminish the impacts of potential confounding. These are advantages of LIMEhap compared with methods used in previous studies.

Despite advantages seen in both simulation and real data analyses, LIMEhap has its limitations. Due to the concern about the impact of an unbalanced design on the type I error and power, we carried out a small simulation, which showed that type I errors remain well maintained yet there was a power loss compared with a balanced design. It is worth pointing out, though, that lower power with an unbalanced design is a common problem in statistical methods for case-control (families) studies. In real data analyses, deviations from a balanced design can be more extreme than what we have explored, which was the case with the autism data analyzed. This situation with the AGP data is by no means unique, though, as it is a fact that control families are harder to recruit than case families; therefore, it is warranted to explore whether LIMEhap can be extended to the setting where only discordant sib-pair families are available without the need to recruit separate control families [36].

Another limitation, which is not unique to LIMEhap either, is the expected limited power for detecting imprinting effects when imprinting varies among individuals. This could potentially be an issue in our ASD analysis. Although the AGP genotypes were obtained mainly with DNA samples from blood, a small proportion of the samples were obtained from cell lines, buccal, and other sources. To the best of our knowledge, there is a dearth of statistical methods for analyzing data exhibiting polymorphic imprinting. Thus, future research is warranted to develop methods with adequate power in such a situation. However, it is out of the scope of the current research.