Some pitfalls in application of functional data analysis approach to association studies

One of the most effective methods for gene-based mapping employs functional data analysis, which smoothes data using standard basis functions. The full functional linear model includes a functional representation of genotypes and their effects, while the beta-smooth only model smoothes the genotype effects only. Benefits and limitations of the beta-smooth only model should be studied before using it in practice. Here we analytically compare the full and beta-smooth only models under various scenarios. We show that when the full model employs two sets of basis functions equal in type and number, genotypes smoothing is eliminated from the model and it becomes analytically equivalent to the beta-smooth only model. If the basis functions differ only in type, genotypes smoothing is also eliminated from the full model, but the type of basis functions used for smoothing genotype effects becomes redefined. This leads to misinterpretation of the results and may reduce statistical power. When basis functions differ in number, no analytical comparison of the full and beta-smooth only models is possible. However, we show that the numbers of basis functions set unequal can become equal during the analysis, and the full model becomes disadvantageous.

standard independent mathematical functions should be set. Two basis function systems, B-spline and Fourier, are widely employed in regional association analysis 3,10 .
Although both GVFs and BSF have been introduced into FDA-based association analysis, only BSF is of interest for gene mapping research because statistical hypotheses are stated in terms of betas. To smooth genotype effects (betas), but not the genotypes themselves, a simplified version of the model, i.e., beta-smooth only, was proposed 3,4,10,11 . The statistical properties of the full and beta-smooth only versions of the functional regression models have been estimated under different scenarios using both independent and family-based samples 3,4,10,11 . Under these scenarios, the power of the beta-smooth only model was very close to that of the full model. Figure 1 illustrates this finding for power estimates that we obtained from analysis of the GAW17 family-based data set 12 , when both rare and common genetic variants were used for trait simulation and association analysis. We obtained the same results under scenarios that included only rare variants, different proportions of causal variants and unidirectional effects 10 . The same results were obtained in other studies using independent samples 3,4,11,13 . Moreover, for most genes tested on real data, the p-values calculated under the full and beta-smooth only models were identical (see, for example, Table 2 in ref. 13).
Questions arise: Are the full and beta-smooth only models equivalent and is it necessary to functionally represent genotypes when analyzing the association between a particular trait and multiple genotypes. To address these questions, we define a functional linear mixed model, to test the association using both independent and structured samples, and analytically consider scenarios that differ by the type and number of basis functions used to model GVFs and BSF.

Models
The traditional linear regression model of multiple additive effects for an arbitrarily structured sample of n individuals is expressed as: Here y is an (n × 1) known vector of trait values; X is an (n × c) known matrix of c covariates including a column of 1's for the intercept; α is a (c × 1) unknown vector of fixed regression coefficients measuring the effects of c covariates; G is an (n × m) known matrix of genotypes of m genetic variants in the region, where G ij is coded by the number of minor alleles of the j th genetic variant in the i-th individual; β is an (m × 1) unknown vector of fixed regression coefficients measuring the effects of m genotypes, and so Gβ is the regional genotypic component of the trait; h is an (n × 1) random vector of polygenic effects distributed as σ N R (0; ) g 2 , and ε is an (n × 1) random vector of errors distributed as σ N I (0; ) e 2 , where σ g 2 and σ e 2 are the respective components of total variance σ σ σ = + g e 2 2 2 of the trait. Here R and I are the × n n ( ) relationship and identity matrices, respectively. Model (1) assumes that the phenotypes y follow a multivariate normal distribution with a mean vector α β = + E y X G ( ) and a covariance matrix  We further introduce a functional linear mixed model, which provides functional smoothing of both the genotypes and their effects on the trait: 1) unknown vector of continuous genetic variant functions (GVFs), and β  t ( ) denotes an unknown continuous beta smoothing function (BSF) under t. In actual data, t 1 ,… ,t m are the ordered physical positions of genetic variants in the region [t 1 , t m ]. We scale [t 1 , t m ] to [0, 1] and let t be a real number in [0, 1] that defines the position of a particular genetic variant in the scaled region. By applying FDA, GVFs and BSF can be described by sets of K G and K β basis functions, respectively. Then, according to 10 vector of basis functions, which were used to smooth the genotypes; vector of basis functions, which were used to smooth the genotype effects; and, finally, Substituting the expressions for  G t ( ) and β  t ( ) from (3) to equation (2) yields The (m × K β ) smoother-matrix W is constructed from two sets of basis functions, φ(t) and ψ(t), intended for smoothing genotypes and their effects, respectively. Matrix W depends on the type and number of the given basis functions, as well as on the positions of genetic variants in the region. Models (1) and (4) differ by regional genotypic components: Gβ versus GWβ F . Moreover, the parameters associated with genotype effects appear as vector β F of size (K β × 1) in model (4) and as vector β of size (m × 1) in model (1).
A simplified functional linear model, which does not smooth genotypes, can be constructed by discretization of the full functional linear model (4) (Chapter 15 in ref. 5). In this case, only beta smoothing function β  t ( ) is used with set ψ(t) of β K basis functions: In model (6), Ψ is an (m × K β ) smoother-matrix constructed similarly to the (m × K G ) matrix Φ from model (4), i.e., Ψ ij = ψ j (t i ); so the matrix Ψ depends on the set of basis functions and the positions of the genetic variants. Model (6) is called beta-smooth only 3,4,11,13 .

Test statistics
To test for an association between the genomic region and the trait, we test null hypothesis H 0 : β = 0 F against alternative hypothesis H 1 : β ≠ 0 F with test statistics using the residual sums of squares (RSS). For example, this test statistics could be 5,14 are the sums of the squares of residuals under H 0 and H 1 , respectively; Ω, σ 2 and α are estimated under H 0 and P is a projection matrix given as for models (4) and (6), respectively. (4) and (6) indicates that regional genotypic components of trait y under the models have similar forms: either GWβ F or GΨβ F . In these expressions, G denotes the matrix of real genotypes, β F is a vector of K β estimated regression coefficients, and W and Ψ are the (m × K β ) smoother-matrices. These transforming matrices allow the genotypes of m variants with K β regression coefficients to be combined to calculate the regional genotypic component of the trait. The parameters of interest in both models are the regression coefficients β F . When the numbers, K β , of the coefficients are equal in two models, the degrees of freedom of statistical tests in these models are equal, too. For the F-test, df 1 = K β and df 2 = n − K β − 1, and the score test is approximated by the χ 2 distribution with df = K β . Models (4) and (6) differ in smoother-matrix construction. Smoother-matrix W in model (4) is defined by two sets of basis functions, namely, φ(t) and ψ(t). By contrast, smoother-matrix Ψ in model (6) uses only one basis function set, ψ(t). Formally, we cannot trace whether the genotypes, or their effects, or both are smoothed in each model, because any smoother-matrix is simply a transforming matrix constructed using the set(s) of basis functions and the positions of the genetic variants. Therefore, the biological meaning attributed to the smoothing process is lost when the models are formally described by expressions (4) and (6).

Comparison of the different models. A comparison of expressions
The regression coefficients (beta-parameters) of models (4) and (6) are estimated using the maximum likelihood approach as: respectively. With these estimates, it is easy to calculate the regional genotypic component of trait y defined by models (4) and (6) as: To compare models (4) and (6), we present W in (5) as the product of three matrices, W 1 , W 2 , and W 3 : , respectively. Unlike matrices W 1 and W 2 , matrix W 3 is independent of the actual data, and is defined only by sets of basis functions φ(t) and ψ(t). Expression (7) describing the regional genotypic component of trait y in model (4) can be rewritten in terms of matrices W 1 , W 2 and W 2 as: T  T T  T  T  T T  1 2 3  3  2  1  1  1 2 3  1  3  2  1  1 Note that matrix W 2 is always invertible, while the invertibility of matrix W 3 depends on how K G and K β relate to each other. Here, only two situations are possible: K G = K β and K G > K β , because the number of basis functions for GVFs should not be less than that for BSF 10 . (8)). Therefore, matrices W 2 and W 3 can be canceled in expression (9). Therefore, GWβ F is expressed only in terms of W 1 , G, and Ω, i.e., using (8) in terms of Φ, G, and Ω: Hence, when K G = K β , model (4) is defined by only one set of basis functions φ(t), and does not use the second set, ψ(t). In this situation model (4) simplifies to model (6), where Ψ ij = φ j (t i ).
In particular, if model (4) employs two sets of basis functions identical in type and number, then ψ(t) = φ(t) and model (4) just reduces to its simplified version (6). In mathematical terms, the model with "double" smoothing becomes equivalent to that with "single" smoothing. In terms of biological meaning, the model with both genotypes and betas smoothed becomes equivalent to that with beta smoothing only. A comparison of expressions (4) and (6) demonstrates that the equivalence of models (4) and (6) is explained by the equivalence of regional genotypic components of the trait. Although the BSFs under models (4) and (6) may be different, as is shown in ref. 11, Fig. 1, the regional genotypic components remain identical. A majority of published studies that compare the statistical power of the full and beta-smooth only models use two sets of basis functions equal in type and number. In light of our results, it becomes clear that the similarity of power estimates for two models is explained by their analytical equivalence rather than by numerical similarity.
If model (4) employs two sets of basis functions that differ only in their type then ψ(t) ≠ φ(t) and model (4) reduces to model (6), in which the type of basis functions is φ(t) rather than ψ(t). In terms of biological meaning, the set of basis functions selected for genotype smoothing in the full model switches to beta smoothing. In this case, the full model may be misleading and/or underpowered. For example, betas are expected to be smoothed by the Fourier basis if the B-spline and the Fourier bases are set for GVFs and BSF, respectively. The results of analysis are in fact equivalent to those obtained from beta-smooth only model with the B-spline basis employed (Fig. 2a). In this situation, the researcher is misled about the type of basis functions used for beta smoothing. If the researcher had initially used the beta-smooth only model with the Fourier basis (instead of the full model), the statistical power of analysis would have been increased (Fig. 2b).
In any case, the full model is unjustified at K G = K β , for the same (or better) association analysis results can more easily be obtained using beta-smooth only model (6). K G > K β situation. Only when K G > K β , matrix W 3 is not invertible (because this matrix is not square) and cannot be canceled in expression (9). As a result, the statistics under models (4) and (6) are different. Although the degree of freedom remains the same for both models, their smoothing strengths may differ. It is not a priori clear which model will have a greater smoothing strength. Moreover, increase in smoothing strength does not always lead to increase in power. As a result, it is impossible to predict when the full model is more powerful than the beta-smooth only model. Figure 3 illustrates that power can change unpredictably when using the full model.
If the researcher still decides to use the full model with two sets of basis functions, the number of basis functions for GVFs and BSF should be controlled to ensure that K G > K β . Otherwise, the full model may become disadvantageous, as we demonstrated previously. However, the number of basis functions defined by the researcher may be changed during analysis. Such situations occur due to trivial restrictions of FDA methods. In particular, the number of genetic variants in the region should not be less than the number of basis functions for GVFs, and the number of basis functions for GVFs should not be less than that for BSF, that is, m ≥ K G ≥ K β 10 . The available software packages for FDA-based association analysis reduce the number of basis functions for BSF to that for GVFs; that is, K β becomes equal to K G when the condition K G ≥ K β is not satisfied. When the genotype matrix includes linear-dependent genotype columns, the number of genetic variants analyzed in the region is reduced to ensure that the matrices are invertible. If the number of genetic variants is reduced to less than the declared K G value, then this value automatically decreases to m. In this case, K β may become equal to K G . The K G value is difficult to control; therefore, the predictability of the behavior of the model with both GVFs and BSF decreases even in those rare cases, when certain advantages can be expected. On the other hand, smoothing strength can easily be regulated without GVFs by adjusting the K β value in the beta-smooth only model.
In all cases, the full model increases the running time 10 , is less predictable and is more difficult to interpret when K G > K β .

Conclusions
We have demonstrated that there is no reason to use the full models that utilize equal sets of basis functions, because the same results can easier be obtained using beta-smooth only models. As far as the full models that use different sets of basis functions are concerned, we have identified several situations, in which genotype smoothing is counterproductive in that it may cause an unpredictable behavior of the model and reduce statistical power.  (6) using B-spline (a) or Fourier (b) basis for BSF. K G = K β = 25. For two models compared in panel (b), powers estimated as a proportion of P ≤ 2.5 × 10 −6 were 0.861 and 0.876 for the full and beta-smooth only models, respectively. The solid line indicates a one-to-one correspondence; the dotted line is the linear regression line. The same data as in Fig. 1 were used.
Moreover, we have identified several situations, in which unequal numbers of basis functions defined by the researcher become equal during the analysis, and the full model becomes equivalent to the beta-smooth only model. Thus the full model offers only illusory benefits in practice. It has hidden pitfalls that should be taken into consideration in planning functional association analyses.