Introduction

In genomic data association analysis, putative functional links can be identified by relating the measurements from single genomic data type to other data sources such as observed phenotypes, gene or protein expressions, molecular markers, functional classifications or cellular responses (for example, Risch and Merikangas, 1996; Jansen and Nap, 2001; Jansen et al., 2002; Bhattacharjee et al., 2008). Quantitative and qualitative traits are commonly studied by using the methods developed for phenotype–marker association analysis. These same methods can also be used to find regulatory pathways or patterns controlling gene expressions (eQTLs) and protein expressions (pQTLs) by treating the expression level of the gene or protein as a classical quantitative phenotype (Jansen and Nap, 2001; Bystrykh et al., 2005; Foss et al., 2007). Other possible links can be identified by using phenotype–expression and phenotype–protein association analysis as well as simultaneous association analysis of multiple data types (Hoti and Sillanpää, 2006; Bhattacharjee et al., 2008; Sillanpää and Noykova, 2008; Bhattacharjee and Sillanpää, 2009). Complementary evidence provided by different data sources can be joined together afterwards to study supported overlapping genomic regions (Aune et al., 2004; Bhattacharjee et al., 2008).

Genome-wide (population-based) association analysis is generally considered to be a main tool to infer causative links between genomic marker data and phenotype (Risch and Merikangas, 1996; McCarthy and Hirschhorn, 2008). This happens regardless of problems such as genetic heterogeneity (Terwilliger and Weiss, 1998; Sillanpää and Bhattacharjee, 2006), winner's curse (Lande and Thompson, 1990; Beavis, 1998; Göring et al., 2001; Xiao and Boehnke, 2009) and missing heritability (Maher, 2008; McCarthy and Hirschhorn, 2008; Slatkin, 2009). A particular property of marker data is the systematic spatial dependence along the chromosome. In gene- and protein-expression data, the spatial dependence of expression values at neighbouring genes (or mass-to-charge ratios) is not well established. However, it is possible that the normalization method used will induce some spatial dependence to the expression data. It is also common to assume the presence of some other form of dependence in the data, namely that the expressions of genes belonging to the same pathway are highly dependent on each other. Exception for the theme is provided by protein antibody microarrays (used in studies of cancer immune responses) in which the presence of spatial dependence is evident (Wu et al., 2009). It is good to keep in mind differences between data types, as they affect the applicability of the methods reviewed in this study.

In a population-based phenotype–marker association study, we hope that study individuals are distantly related at the small genome regions (containing the trait loci) so that there is systematic linkage disequilibrium generated by rare occurrence of recombination events within these regions during a large number of meioses in the ancestral pedigree. At the same time, individuals in the study sample are assumed to be mutually independent (unrelated or equally related to each other). However, this assumption does not hold for individuals showing more complex relationship structures, and the potential existence of such discrepancy in the data is generally known as a cryptic relatedness problem (for example, Devlin and Roeder, 1999; Voight and Pritchard, 2005; Zhang and Deng, 2010). In cryptic relatedness, relationships between individuals in the sample are typically either completely known (for example, pedigrees or families are available) or unknown (for example, a sample of potentially related individuals). The existence of individuals originating from multiple source populations in the study sample may create another problem known as population stratification (for example, Lander and Schork, 1994; Cardon and Palmer, 2003). In addition, in this case, the population structure may be either known (for example, a sample of very distinct populations) or unknown (for example, a random sample of individuals from a single or multiple sites).

It is well known that population-based genomic data association analyses generally suffer from confounding because of population stratification (inability to divide the variance into within- and among-population components) and cryptic relatedness (inability to account for varying within-population relationships among study individuals). If not properly accounted for, spurious associations may occur in the genomic data association analyses because of these confounding factors (stratification or cryptic relatedness) rather than real association between the tested genomic factor and the trait value. Population stratification is a more widely discussed topic than cryptic relatedness. Yet, several papers on both topics can be easily found in phenotype–marker association studies (Lander and Schork, 1994; Cardon and Palmer, 2003; Yu et al., 2006; Kang et al., 2008), and also in phenotype–expression association studies (Gibson, 2003; Kraft and Horvath, 2003; Kraft et al., 2003; Lu et al., 2004) and clinical quantitative trait locus studies in which phenotype is simultaneously explained by multiple data types (Hoti and Sillanpää, 2006; Pikkuhookana and Sillanpää, 2009). It may be noted that in some cases, for example, in model organisms, population stratification and cryptic relatedness may both be needed to be corrected simultaneously (Yu et al., 2006; Kang et al., 2008; Stich et al., 2008).

In the first section, we will cover the most common approaches to controlling population stratification. In the second section, we will consider approaches for cryptic relatedness. Finally, we consider the use of estimation-based variable selection and multilocus association models and their robustness to these confounding factors. In each point, the methods are presented for determining phenotype–marker association, but the suitability of each approach is also commented for other types of genomic data association analyses.

Approaches for population stratification

In principle, it is possible to minimize the risk of population stratification by carefully selecting the study material from a genetically isolated population, or by using stringent ethnic origin criteria. Otherwise, in population-based association studies, the current techniques to overcome the problem of hidden population structure (stratification) can roughly be divided into the following categories: (1) stratified analysis, (2) genomic controls, (3) structured association, (4) smoothing, (5) principal component approach, (6) matching, (7) approaches based on relationship information and, finally, (8) use of secondary samples. Even though most of these approaches have been considered only together with genetic marker data, they may be arguably applicable with small changes also for other data types (that is, use of gene and protein expressions as explanatory variables). In the following, we shortly present the underlying ideas behind these approaches.

Stratified analysis

If groups of individuals are known or have been observed before the analysis, it is possible to study within-group genomic associations, which are robust to population stratification (Clayton, 2007). Completely separate within-group (within strata) analyses will decrease the overall statistical power because of small sample sizes, but it is also possible to combine within-group information in the test statistic or joint likelihood. The family-based association test methods rely on this principle by using family as unit for a known group (Lange et al., 2002; Horvath et al., 2004).

Unfortunately, within-group membership information or families are not always easy to collect and one can try to approximate membership information (construct approximate ‘families’) based on some other information that is available, for example, self-identified ethnicity in human data (see Tang et al., 2005), known location where individuals’ grandparents lived, or estimate the most likely ancestry (population assignment) or pairwise relatedness based on an independent set of molecular markers (for example, Lynch and Ritland, 1999; Pritchard et al., 2000a; Weir et al., 2006). For within-population stratified analyses, see the section ‘Structured association’. It is, however, known that population-based association studies even with related individuals are statistically more powerful than family-based (within family) association studies (Teng and Risch, 1999; Havill et al., 2005; Aulchenko et al., 2007a; Hernandéz-Sánchez et al., 2003). Thus, other ways of correcting for population stratification than this may be more favourable.

Genomic controls

In the genomic control approach, one modifies (adjust) the threshold P-value on the basis of a neutral set of independent markers—null markers (genetic marker panel) providing information on the adequate adjustment factor (Devlin and Roeder, 1999; Banacu et al., 2002; Zheng et al., 2006). The adjustment factor (λ) describes variance inflation, that is, reduction in effective sample size (≈N/λ), where N is the original sample size (see Hinds et al., 2004). The underlying assumption for this method is that all the spurious association signals are smaller than the real signals, so that it is possible to handle the problem by adequately adjusting the threshold value. It is assumed that this external set of markers does not include any trait-associated loci. The benefit of this approach is that one does not need to make any assumptions on the number of subpopulations. Different variants of the genomic control approach have been considered and compared by Dadd et al. (2009). An improved version of this approach was presented by Wang (2009), in which, instead of using a single adjustment factor, one can use several of them. However, it has been argued that the genomic control approach suffers from weak statistical power when the effect of population structure is large, as is common in model organisms (Yu et al., 2006; Kang et al., 2008). In principle, by using the genomic control approach, it should be possible to construct a related genomic control adjustment factor based on the neutral gene expression data, which can then be applied for testing phenotype–expression association.

Structured association

In the structured association, unknown population membership probabilities are first estimated using population assignment methods (for example, Pritchard et al., 2000a; Dawson and Belkhir, 2001; Corander et al., 2003; Falush et al., 2003; Alexander et al., 2009). These probabilities are subsequently used in association analysis (Pritchard et al., 2000b; Thornsberry et al., 2001; Yu et al., 2006). Phenotype–marker association at candidate locus can be tested within subpopulations using likelihood ratio test (Pritchard et al., 2000b; Thornsberry et al., 2001). Alternatively, the association model can include the subpopulation mean terms weighted with individual membership probabilities (Yu et al., 2006). Versions of the method in which both of these tasks (estimation of population memberships and phenotype–marker association) are carried out simultaneously exist (for example, Satten et al., 2001; Ripatti et al., 2001; Sillanpää et al., 2001; Hoggart et al., 2003). In all of these structured association methods, population memberships are estimated on the basis of neutral set of independent markers (genetic marker panel). For exception to this, see Sillanpää and Bhattacharjee (2006). It may be noted that the structured association approach has been recently extended to genome-wide sets of marker loci (Alexander et al., 2009). Generally, it is easy to include the subpopulation mean terms also in other types of genomic (that is, phenotype–expression or phenotype–protein) association models or in models that consider marker and expression data jointly to explain the phenotype.

Smoothing

The idea in the smoothing approach is that the signal is smoothed along the chromosome according to an exponential decay or some other spatial function, depending on the genetic or physical map distances (Conti and Witte, 2003; Sillanpää and Bhattacharjee, 2005; Tsai et al., 2008). This is feasible for a tightly linked set of markers that are in considerable linkage disequilibrium with each other. In such a setting, neighbouring markers are used to strengthen the weak but real association signals and smooth the spurious association signals downwards. A similar control approach for spurious peaks has also been proposed for proteomics data sets, in which protein intensity peaks are smoothed with respect to the neighbouring locations (Du et al., 2006) and according to the m/z distance between the positions (Bhattacharjee et al., 2008). For a related smoothing approach for expression data, see Sillanpää and Noykova (2008).

Principal component approach

The use of principal component analysis to correct the stratification in structured populations has been suggested (Patterson et al., 2006; Price et al., 2006; Zhu et al., 2008). One proceeds by assuming that an external set of neutral molecular markers is available and any of them is not associated with the trait of interest. Principal components are first estimated from the correlation matrix of external marker genotypes of unrelated individuals (Price et al., 2006). Then, the first few principal components (explaining most of the underlying variation) are used as regression covariates in the association model, or are incorporated into a randomization test (see Kimmel et al., 2007). On minor levels of stratification, one can simply omit samples that appear as outliers in principal component approaches. In the related method of Epstein et al. (2007), instead of principal components, one uses components from a partial least-squares regression. Owing to their ease of implementation, these methods are very popular in human genetics, even if the correction given by them may fail, especially if the external marker set used is not large (for example, Epstein et al., 2007; Lee et al., 2008; Wang, 2009). It should be relatively easy to apply these corrections (estimated either from external marker or from expression data) to phenotype–expression and phenotype–protein–expression association studies.

Matching

During the design stage of the study, it is possible to collect pairs of individuals (full sibs or cousins; one case and one control individual) that are otherwise similar (that is, individually matched) with respect to covariates such as ethnic background, sex, age and so on (Gauderman et al., 1999; Zondervan et al., 2002). Group matching refers to a similar process in which instead of individuals, groups are matched. Even if matching could eliminate the problem of population stratification, one potential problem is over-matching (that is, reduction of statistical power by unnecessary matching on too many factors, which creates matched units that are too much alike also in their phenotypes). To consider matching in quantitative traits, phenotypically discordant sib-pair collection strategies may be used (cf. Risch and Zhang, 1995).

Special procedures for genetic ancestry matching that are applicable to existing data sets have been proposed lately on the basis of the information on non-genetic variables (Lee, 2004) and marker genotypes (Hinds et al., 2004; Luca et al., 2008; Guan et al., 2009). In Hinds et al. (2004), ancestry was estimated by population assignment methods, and in Luca et al. (2008) by principal components. These methods considered different strategies for genetic ancestry group matching and removing ‘unmatchable outlier individuals’. Guan et al. (2009) proposed the use of identity-by-state-based simple (dis)similarity measures as a tool for individual genetic ancestry matching. Luca et al. (2008) emphasized the use of ‘control databases’ in genome-wide association studies and considered a problem wherein cases are sampled in a quite different region from that of the controls.

Generalization and application of these techniques to other forms of genomic association analysis should be relatively easy.

Use of relationship information

Affected trios

The transmission and disequilibrium test (TDT) allows to control for confounding by studying association in the presence of linkage (Spielman et al., 1993). Originally TDT was developed as a confirmatory second test to filter real associations out of spurious signals. This original motivation is well justified because TDT has lesser power than ordinary association testing (Long and Langley, 1999). However, TDT is nowadays commonly used as a general test of association. The basic version of TDT assumes binary phenotype, biallelic loci and uses data on affected trios (unrelated cases and their parents). At a given locus, it tests whether a certain allele is transmitted from heterozygous parents to the affected offspring more often than expected under Mendelian segregation, and the observed segregation distortion is taken as evidence for the locus having something to do with the affection status. TDT has been generalized to quantitative traits (Allison, 1997; Rabinowitz, 1997; Abecasis et al., 2000), multiallelic markers (Sham and Curtis, 1995), marker haplotypes (Clayton, 1999) and to several data structures (Spielman and Ewens, 1998; Abecasis et al., 2000). As handling of nonrandom missing genotype data in parents is problematic, a robust version of TDT has also been developed (Sebastiani et al., 2004).

Pseudo-control data

Another relationship information-based approach to control for confounding is the so-called pseudo-control approach, in which a sample containing only affected cases (and their parents) is collected and the artificial control sample is derived indirectly on the basis of parental genotypes and haplotypes. At each locus, pseudo-control individuals have genetic material that was not transmitted from the parents to the cases (for example, Falk and Rubinstein, 1987; Terwilliger and Ott, 1992; Lander and Schork, 1994; Gauderman et al., 1999; Greenland, 1999). The benefit of deriving the control sample in this way is that one obtains well-matched controls and avoids spurious associations because of ethnic confounding, that is, closer kinship among the affected samples (Terwilliger and Weiss, 1998). This pseudo-control approach has been generalized also to multilocus association analysis and single-tail sampling with quantitative traits (Sillanpää and Hoti, 2007).

Pedigree data

More general approaches use pedigree data, in which linkage information and association information are combined (George et al., 1999; Lund et al., 2003; Pérez-Enciso, 2003; Meuwissen and Goddard, 2004; Meuwissen and Goddard, 2007; Gasbarra et al., 2009; Hernandéz-Sánchez et al., 2009), resulting in a strong signal at true positions. Linkage information confirms only the real associations and one obtains weaker signals at the spurious positions, which provides a way to control confounding in association studies.

In pedigree-based linkage analysis, founder individuals are generally assumed to be unrelated. However, association information is also available in pedigrees, when founders are related. Thus, combined analysis tries to model also relationships between pedigree founders. To do so, some assumptions (for example, from effective population size) are often made from founders and/or a recent history of the population (see for example, Meuwissen and Goddard, 2001).

Correction methods using relationship information are not easy to generalize to gene- or protein-expression data because these approaches are based on the discrete nature of the marker data and the linkage concept.

Use of secondary samples

The idea behind this approach is that analysis is carried out jointly for two samples of data from the same study population, but with different study designs: one containing a population-based sample of individuals (the association signal from these data suffers from confounding) and the other sample comprising related individuals (the association signal from these data is robust to confounding; see above). As the overall signal is a synthesis of the individual signals of two data sets, it is likely to be relatively robust to confounding (for example, Epstein et al., 2005; Kazeem and Farrall, 2005). In addition, this ‘meta-analysis’ approach improves the statistical power by combining information from multiple data sets. For a review and comparison of different secondary sample approaches, see Glaser and Holmans (2009) and Infante-Rivard et al. (2009). Alternatively, it is possible to analyse these two samples (with association and linkage) separately and study the overlap between the results (Manenti et al., 2009), or carry out association testing conditionally on the linkage results (Cantor et al., 2005).

As the use of secondary samples to correct for population stratification relies on two separate samples, issues relating to the presence of heterogeneity cannot be fully ruled out (see Sillanpää and Auranen, 2004). In addition, as this correction method is based on the use of relationship information and marker linkage, this method is not easy to generalize to gene- or protein-expression data. A variant in which two samples are combined and population stratification is corrected for by using the principal component approach (Zhu et al., 2008) should also be applicable for gene- or protein-expression data.

Approaches for cryptic relatedness

It is good to keep in mind that population stratification and cryptic relatedness are two different problems and correction methods typically consider only a single problem at a time. Exceptions to this were the stratified analysis, genomic controls and the use of relationship information subsections above. Especially useful in this respect may be the approaches in which linkage information and association information are combined. Otherwise, the current techniques to overcome the problem of cryptic relatedness in population-based association analysis can be divided into the following categories: (1) infinite polygenic model, (2) regression covariates, (3) test-statistic accounting for relatedness and (4) genomic controls. These techniques can be generalized quite easily to the other genomic data analyses. In the following, we shortly describe the ideas behind these methods.

Infinite polygenic model

The classical approach to correcting for relatedness in the sample is to include a polygenic component into the population-based genomic association analysis model (for example, George and Elston, 1987; Jannink et al., 2001; Lu et al., 2004; Yu et al., 2006; Bradbury et al., 2007; Pikkuhookana and Sillanpää, 2009). This practice is also known as the measured genotype approach. Because of the availability of large high-throughput association data sets, it is more popular to use a recent variant of this approach, called GRAMMAR (Amin et al., 2007; Aulchenko et al., 2007a, 2007b), in which residual dependencies are first precorrected from data and repeated (phenotype–marker) association analyses are carried out for adjusted residuals using rapid methods. Even though this approach was presented for marker data, it is straightforward to use it (or measured genotype approach) in concert with other types of genomic (for example, phenotype-expression or other) association analysis. Nevertheless, the precorrection approach suffers from model misspecification. It underestimates uncertainty in polygenic effects, and may reduce the statistical power (cf. Martinez et al., 2005). Moreover, estimation of variance components is known to be unstable when sample sizes are small (Misztal, 1996; Burton et al., 1999; Pikkuhookana and Sillanpää, 2009). In any case, the GRAMMAR approach has been shown to outperform many of the competing methods introduced for pedigree-based association analysis (see Aulchenko et al., 2007a). For unknown relationships, a relationship matrix can be first estimated using various methods (for example, Milligan, 2003; Leutenegger et al., 2003; Blouin, 2003; Weir et al., 2006; Frentiu et al., 2008). Interestingly, the use of a simple identity-by-state allele-sharing matrix has recently been found to provide an efficient alternative to more sophisticated methods in correcting for cryptic relatedness-induced confounding (Zhao et al., 2007; Kang et al., 2008). See also van de Casteele et al. (2001) and Bink et al. (2008).

Regression covariate approach

An alternate approach to correcting for relatedness is to use the method of Bonney (Bonney, 1986; Thomas, 2004). This method approximates the influence of the infinite polygenic model by having phenotypes of the parents, the spouse and sibs of the participant as regression covariates in the model (Bonney, 1986). Pikkuhookana and Sillanpää (2009) compared the performance of these two approaches (infinite polygenic model and regression covariates) in Bayesian genomic association models and found the regression covariate approach to perform better for smaller genomic data sets. It may be noted that their model considered the effects of marker genotypes and gene expressions jointly in explaining the phenotype. Although this approach provides a framework for including phenotypes from ungenotyped parents into the analysis (cf. Purcell et al., 2005), it cannot be applied in case of the unknown relatedness, as the phenotypes of the relatives are generally not available.

Test statistic accounting for relatedness

The test statistics often used for population-based association analysis assume the independence of individuals in the sample. Test statistics for population-based phenotype–marker association testing among related individuals (correlated family data) have been introduced, for example, for (between-family) association in family-based design (Teng and Risch, 1999) or for a more general design using all the related and unrelated individuals (Slager and Schaid, 2001). A similar kind of test based on family-based (within-family) association analysis has been developed for expression data (Kraft et al., 2003). In case of unknown relationships, one can start by estimating the family structure (or pairwise relatedness) using an additional set of molecular markers (for example, Gasbarra et al., 2007; Bink et al., 2008). However, one should be careful here whether the test statistic is measuring population-based or within-family association. Unlike within-family association, population-based association analysis with related individuals suffers from population stratification, but has more statistical power than family-based (within family) association analysis, which was considered in the section ‘Stratified analysis’ (Teng and Risch, 1999; Slager and Schaid, 2001). With modifications, it is possible to derive suitable test statistics accounting for relatedness in population-based association for gene- or protein-expression data.

Genomic controls

Even though genomic control has been introduced to control for population stratification, it has also been suggested for handling cryptic relatedness (Devlin and Roeder, 1999; Banacu et al., 2002; Yan et al., 2009). In this method, variation inflation is corrected by adjusting the test statistic based on the information from unlinked null markers. This works also in case of unknown relatedness because unlike the infinite polygenic model, one does not need to estimate the pairwise relationships first. The genomic control approach has also been suggested to be useful for additional correction after the infinite polygenic model is fitted (Amin et al., 2007).

Use of estimation-based variable selection and multilocus models

It may sound strange that by using estimation-based variable selection and multilocus models (without a correction term), one can automatically reduce the number of false positives in genomic data association analyses. However, there is increasing evidence that Bayesian or frequentist multilocus modelling approaches are flexible enough to automatically account or self-correct for population stratification in binary traits (Setakis et al., 2006), ordinal and censored traits (Iwata et al., 2009), as well as in quantitative traits (Iwata et al., 2007). In case of cryptic relatedness and quantitative traits, the self-correction property of Bayesian multilocus association approach was found by Pikkuhookana and Sillanpää (2009). The robustness of the multilocus association approach to these problems results presumably from the fact that, during the estimation (variable selection) process, other genetic components, few at a time, could capture or explain a small amount of confounding variation (see Pikkuhookana and Sillanpää, 2009). This is essentially so because in these approaches, variable selection is done simultaneously with the effect estimation (Kilpikari and Sillanpää, 2003; O’Hara and Sillanpää, 2009) and large candidate panels jointly have the potential to explain many types of variation. For a close connection between the multilocus association model and a polygenic model with realized relationship matrix, see Hayes et al. (2009). However, additional studies on the self-correction property of these approaches are needed before one can say anything definitive on this, for example, on how much variability in the markers is needed for self-correction approach to be effective. In two-genotype data analysis (in which there is only a single estimable effect coefficient at each locus), Iwata et al. (2007) found that the use of a correction term in the model still provides some additional advantages over self-correction. However, it is likely that using single-nucleotide polymorphism data and by fitting two estimable coefficients (for three genotypes) can provide more variability and again more ability for self-correction in the model. As a conclusion, I wish to emphasize that there are good reasons why future studies on genomic data association analysis should focus on or at least pay much more attention to better characterizing the benefits and pitfalls of these estimation-based multilocus approaches. It is also well known that the use of multilocus models improves statistical power and helps avoid problems due to model misspecification, such as biased position estimates, and occurrence of ‘ghost QTLs’ (see, for example, Sillanpää and Auranen, 2004).