Genotype-based association analysis via entropy

Li, Yu-Mei; Xiang, Yang

doi:10.1038/jhg.2012.102

Download PDF

Original Article
Published: 23 August 2012

Genotype-based association analysis via entropy

Yu-Mei Li¹ &
Yang Xiang¹

Journal of Human Genetics volume 57, pages 734–737 (2012)Cite this article

370 Accesses
1 Citations
Metrics details

Subjects

Abstract

With the advent of large-scale genotyping technologies, enormous quantities of genotype data that were generated have been well exploited through phased haplotypes, and the haplotype-based association study is used as one of the major statistical methods for gene mapping of human complex traits. However, haplotype-based method depends on the information of haplotype frequencies, which results in infeasible computation when haplotypes are not directly observed. This paper provides a genotype-based statistic with multiple tightly linked markers for association analysis using entropy theory. The statistic here does not require haplotype phasing and only requires genotype data. The distribution and the power of the statistic are investigated by simulative study. The results show that the statistic has very reasonable performance. We demonstrated the powerfulness of the statistic by applying our approach to a specific example on hereditary hemochromatosis.

Accurate and efficient estimation of local heritability using summary statistics and the linkage disequilibrium matrix

Article Open access 02 December 2023

Hui Li, Rahul Mazumder & Xihong Lin

Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits

Article Open access 01 May 2023

Brian C. Zhang, Arjun Biddanda, … Pier Francesco Palamara

Accuracy of haplotype estimation and whole genome imputation affects complex trait analyses in complex biobanks

Article Open access 26 January 2023

Vivek Appadurai, Jonas Bybjerg-Grauholm, … Andrew J. Schork

Introduction

Linkage disequilibrium (LD), the non-random co-occurrence of alleles from different loci, has a fundamental role in genetic studies as a tool for gene mapping of human complex traits. However, the level of LD is often influenced by a number of factors, including genetic linkage, selection, the rate of recombination and mutation, genetic drift, non-random mating, population structure and other non-biological forces. These bring a great many challenges for LD mapping or association analysis in genetic studies. One of those challenges is to develop novel statistical methods to improve the power of gene mapping. Although many LD methods have been well developed currently for complex disease genes, haplotype-based analysis is one of the major statistical methods because haplotypes of multiple single-nucleotide polymorphisms (SNPs) are considered a more informative format of polymorphisms for genetic analysis than single SNP.¹

The classical haplotype-based statistic is to compare haplotype frequencies between affected and unaffected individuals^{1, 2} or to compare haplotype similarities between affected and unaffected individuals.^{3, 4} Recently, Zhao et al.⁵ proposed an entropy-based statistic T_PE for a genome-wide association study through a non-linear transformation of haplotype frequencies. Nonetheless, the results of these methods are not uniformly consistent.^{4, 5} An important reason is that haplotype-based method depends on the information of haplotype frequencies, which results in infeasible computation for estimating haplotype frequencies when haplotypes are not directly observed.

In this paper, we will propose an entropy-based statistic; here we denote it as T_GE, as an alternative method of Zhao et al.,⁵ which only allows for genotype data at linked markers. We will investigate the performance of the statistic T_GE by computer simulations and apply it to a real data set on hereditary hemochromatosis (HH).

Materials and methods

Composite LD measure

We consider two marker loci ‘1’ and ‘2’, with co-dominant alleles A and a for loci 1 and B and b for loci 2. The frequencies of alleles A, a, B and b are given by P_A, P_a, P_B and P_b, and the frequencies of genotype AA, Aa, aa, BB, Bb and bb are given by P_AA, P_Aa, P_aa, P_BB, P_Bb and P_bb, respectively. Let P_AB, P_Ab, P_aB and P_ab be the haplotype frequencies of AB, Ab, aB and ab, respectively. Let P_/ be the joint frequency of alleles for loci 1 and 2 in two different gametes, for example, P_A/B is the joint frequency of alleles A and B in two different gametes. The composite LD coefficient is defined as Δ_AB=P_AB+P_A/B−2P_AP_B, and an estimator of Δ_AB is defined as , where , N is the total number of subjects, n is a count of the number of subjects with the phenotype indicated by its subscript, and P̂_A and P̂_B are estimates of allele frequencies. It can be seen that these estimators only rely on the information of genotype. We define a two-dimensional random variable X=(X₁,X₂)^T as the state of two alleles A and B, where X₁ and X₂ denote the number of copies ‘A’ and ‘B’ for marker ‘1’ and ‘2’, respectively. The probability distributions of X₁ and X₂ are described in Table 1. It can be shown that , and the covariance of X₁ and X₂ is equal to twice that of the composite LD coefficient Δ_AB:cov(X₁, X₂)=2Δ_AB.

Table 1 The probability distributions of X₁ and X₂

Full size table

The entropy-based statistic T_GE

Suppose that there are k markers, each of which has two alleles A and a. The frequency of allele A is denoted by P_i for ith marker (i=1,…, k). The entropy of the frequencies P_i for the ith marker is defined as H_i=−P_i log P_i.⁶ Denote the first partial derivatives of the entropy H_i with respect to the frequency P_j as z_ij: z_ij=−1−logP_i for i=j, z_ij=0 for i≠j, i, j=1,…, k. To simplify our presentation, a measure with a superscript ‘A’ indicates a measure in affected individuals and a measure with a superscript ‘C’ indicates a measure in unaffected individuals. The frequencies of alleles A in affected individuals for k markers are given by the vector and the frequencies of alleles A in unaffected individuals are given by the vector . Let . Denote X_i as the state of allele A, that is, the number of copies ‘A’ for marker ‘i’, i=1,…, k. Define a k-dimensional random variable X=(X₁, X₂,…X_k)^T. Suppose that there are n^A affected individuals and n^C unaffected individuals (n^A+n^C=2N). The total number of the alleles A for marker ‘i’ in n^A affected individuals and n^C unaffected individuals are denoted as N_i^A and N_i^C, respectively. be the state of allele A for the ith (i=1,·2,…,n^A) affected individual and the jth (j=1,·2,…,n^C) unaffected individual, respectively. It is easy to see that

Let . Note that are the maximum likelihood estimator of the vector P^A and P^C, respectively, here; let , and from the asymptotic normality of the maximum likelihood estimator, we have

,

where for i≠j,⁷ here, σ^A_ii is the variance of X_i for marker ‘i’, and Δ^A_ij and Δ^C_ij are the composite LD coefficient for marker ‘i’ and marker ‘j’, respectively.

. Let , be the estimators of H^A, H^C, W^A and W^C, respectively. Then the entropy-based statistic can be defined as

The statistic T_GE is asymptotically a central χ²_(k) distribution under the null hypothesis of no association and is asymptotically a non-central χ²_(k) distribution⁸ with the non-centrality parameter under the alternative hypothesis of association. When the inverse of the matrix does not exist, the generalized inverse of the matrix will be used.

Results

Distribution of the statistic T_GE and type I error rate

To assess the distribution of the T_GE in finite samples, a simulation study is performed to investigate the distribution of the statistic T_GE under the null hypothesis of no association. The simulations using the computer program SNaP⁹ are implemented similar to those in Zhao et al.⁵ We consider two biallelic marker loci that generate four haplotypes (AB, Ab, aB and ab) with frequencies 0.2952, 0.2562, 0.1957 and 0.2529, respectively. We randomly generate 20 000 individuals in the general population and divide them into equal groups of cases and controls. Then, we randomly sample N individuals (here, n^A=n^C=N) each from the cases and controls with 10 000 replicates; and thus, it follows 10 000 values of the statistic T_GE. Figure 1 plots the histograms of the test statistic T_GE when 2N is 200. It can be seen that the distributions of T_GE is similar to the theoretical central χ²_(k) distributions. For a given significance level α (α=0.05), the type I error rate is estimated as the proportion of rejecting the null hypothesis in 10 000 replicates performed when the null hypothesis holds. The estimated type I error rates for sample sizes (2N) from 200 to 1000 individuals are exhibited in Table 2. We can see that the type I error rates are around the nominal levels α=0.05. All these indicate the validity of the statistic T_GE.

Table 2 Estimated type I error rates of the statistic T_GE for 10 000 simulations

Full size table

The power of the test statistic T_GE

To evaluate the power of the statistic T_GE and compare the power of the T_GE with that of the statistic T_PE, we performed simulations under the alternative hypothesis of association. Consider a biallelic disease locus with alleles D and d and two biallelic marker loci that generate four haplotypes (AB, Ab, aB and ab) with frequencies 0.2952, 0.2562, 0.1957 and 0.2529, respectively. The disease allele frequency is taken as P_D=0.3. The penetrances of genotypes DD, Dd and dd are denoted by f₁₁, f₁₂ and f₂₂, respectively. The overall LD¹⁰ between the four haplotypes and the allele D at the disease locus are chosen as 0.0728, −0.0448, 0.0192 and −0.0472, respectively. We consider four genetic models: additive model, dominant model, recessive model and multiplicative model. Under a specific genetic model, for a particular sample size, the power is estimated as the proportion of rejecting the null hypothesis in 10 000 replicates performed when the alternative hypothesis holds. Figure 2 shows the power curves against sample size at the 0.05 significance level for T_GE and T_PE. It can be seen that when the sample size is larger than 100, the statistic T_GE has consistently higher power than the statistic T_PE under the dominant model. The power of the T_GE is almost the same as that of the T_PE under the multiplicative model and the additive model, except when the sample size is smaller than 200 for the additive model. But we can see that the statistic T_PE dominates the statistic T_GE under the recessive model when the sample size is smaller than 400.

Applications to HH

HH is inherited as a recessive disease resulting in excessive iron absorption from diet and leads to chronic disease and early death. It is one of the most common inherited diseases among people of European descent, with an estimated prevalence of 1/200 to 1/400, with an even higher prevalence likely in the Irish population.¹¹ About 1 in 8 to 1 in 10 Australians of Northern European ancestry are genetic carriers for HH.¹² In the course of cloning the hemochromatosis gene, genotypes in 101 HH cases and 64 controls were genotyped at 43 microsatellite repeat markers that span the 6.5-Mb HH gene region¹³ (http://link.springer.de/link/service/journals/00439/index.htm). We analyzed the data using the statistic T_GE. To simplify our computation, four markers are used in our analysis: 2229, 2241, 2242 and 2236. We take the ancestral allele given in Thomas et al.¹³ at each marker as allele A and all the other alleles as allele a. In Table 3, we present the values of the statistic T_GE and the corresponding P-values for testing the association of two, three and four SNP markers with HH. It is evident that the statistic T_GE obtained a very small P-value.

Table 3 Tests of association between hemochromatosis genotype and HH

Full size table

Discussion

With the availability of high-density maps of SNP markers, population-based LD mapping or association study provides an unprecedented opportunity for identifying genetic variants that influence human complex trait. Haplotype-based analysis is used as one of the major statistical methods for mapping gene of human complex trait. When only genotype data at multiple loci are collected from a sample of unrelated individuals, haplotype-based methods need estimating haplotype phases and frequencies. Although current computational and laboratory methods promise improved determination of haplotype phase, the haplotype-based method is not yet cost-effective.^{14, 15} Here, we proposed an entropy-based statistic using genotype data for association analysis. The validity and the power of the statistic are demonstrated by simulation analysis and case study. In contrast with haplotype-based association study, our method does not need the haplotype information, and the power of our method is sometimes higher than that of haplotype-based method (for example, the method of Zhao et al.⁵) when the sample size is not very small. Our paper here mainly investigated the effect of mode of inheritance on the power and the effect of LD between SNPs on the power is not examined. More work awaits to be done with the effect of LD between SNPs in the future.

References

Akey, J., Jin, L. & Xiong, M. Haplotypes vs single marker linkage disequilibrium tests: what do we gain? Eur. J. Hum. Genet. 9, 291–300 (2001).
Article CAS Google Scholar
Chapman, N. H. & Wijsman, E. M. Genome screens using linkage disequilibrium tests: optimal marker characteristics and feasibility. Am. J. Hum. Genet. 63, 1872–1885 (1998).
Article CAS Google Scholar
Bourgain, C., Genin, E., Margaritte-Jeannin, P. & Clerget-Darpoux, F. Maximum identity length contrast: a powerful method for susceptibility gene detection in isolated populations. Genet. Epidemiol. Suppl. 21, S560–S564 (2001).
Article Google Scholar
Tzeng, J. Y., Devlin, B., Wasserman, L. & Roeder, K. On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am. J. Hum. Genet. 72, 891–902 (2003).
Article CAS Google Scholar
Zhao, J. Y., Boerwinkle, E. & Xiong, M. An entropy-based statistic for genomewide association studies. Am. J. Hum. Genet. 77, 27–40 (2005).
Article CAS Google Scholar
Shannon, C. E. A mathematical theory of communication. Bell. Systems Tech. J. 27, 379–423 (1948).
Article Google Scholar
Rao, C. R. Linear Statistical Inference and its Applications, John Wiley: New York, (1973).
Book Google Scholar
Lehmann, E. L. Theory of Point Estimation, John Wiley & Sons: New York, (1983).
Book Google Scholar
Nothnagel, M. Simulation of LD block-structured SNP haplotype data and its use for the analysis of case-control data by supervised learning methods. Am. J. Hum. Genet. 71 (Suppl), A2363 (2002).
Google Scholar
Xiong, M., Zhao, J. Y. & Boerwinkle, E. Haplotype block linkage disequilibrium mapping. Front. Biosci. 8, 85–93 (2003).
Article Google Scholar
Feder, J. N., Gnirke, A., Thomas, W., Tsuchihashi, Z., Ruddy, D. A., Basava, A. A. et al. A novel MHC class I-like gene is mutated in patients with hereditary haemochromatosis. Nat. Genet. 13, 399–408 (1996).
Article CAS Google Scholar
Edward, C. Q., Griffen, L. M., Goldgar, D., Drummond, C., Solnick, M. H. & Kushner, J. P. Prevalence of haemochromatosis among 11,065 presumbly healthy donors. N Engl. J. Med. 318, 1355–1362 (1988).
Article Google Scholar
Thomas, W., Fullan, A., Loeb, D. B., McClelland, E. E., Bacon, B. R. & Wolff, R. K. A haplotype and linkage disequilibrium analysis of the hereditary haemochromatosis gene region. Hum. Genet. 102, 517–525 (1998).
Article CAS Google Scholar
Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
Article CAS Google Scholar
Ruiz-Marín, M., Matilla-García, M., Cordoba, J. A., Susillo-González, J. L., Romo-Astorga, A., González-Pérez, A. et al. An entropy test for single-locus genetic association analysis. BMC Genet. 23, 11–19 (2010).
Google Scholar

Download references

Acknowledgements

This study was supported by the Foundation of Hunan Educational Committee (11B095) and National Social Science Fund Youth Project (11CTJ003).

Author information

Authors and Affiliations

Department of Mathematics, Huaihua University, Huaihua, China
Yu-Mei Li & Yang Xiang

Authors

Yu-Mei Li
View author publications
You can also search for this author in PubMed Google Scholar
Yang Xiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu-Mei Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, YM., Xiang, Y. Genotype-based association analysis via entropy. J Hum Genet 57, 734–737 (2012). https://doi.org/10.1038/jhg.2012.102

Download citation

Received: 07 June 2012
Revised: 16 July 2012
Accepted: 20 July 2012
Published: 23 August 2012
Issue Date: November 2012
DOI: https://doi.org/10.1038/jhg.2012.102

Keywords

This article is cited by

An adaptive strategy for association analysis of common or rare variants using entropy theory
- Yu-Mei Li
- Chao Xu
- Hong-Wen Deng
Journal of Human Genetics (2017)