Introduction

Genetic and nongenetic (so-called environmental) factors are involved in the insulin resistance (IR) metabolic syndrome. IR and dyslipidemia are associated with increased risk of cardiovascular diseases. Among the various blood variables, plasma low-density lipoprotein cholesterol (LDL-C) is used as a marker of cardiovascular risk. Numerous papers report associations between genetic variants and IR, features of the metabolic syndrome and dyslipidemia. Several recent genome-wide association studies reported many loci that may modulate dyslipidemia.1, 2, 3, 4, 5 In this study, we selected genetic variants distributed on various human chromosomes and previously reported to be associated with either IR or dyslipidemia and searched for those explaining LDL-C level. PPARG P12A is associated with IR,6 and LDL-C.7 UCP3 −55C>T has been reported to be associated with type 2 diabetes and atherogenic lipid profile.8 Haplotypes of the promoter (−11391G>A, −11377C>G) and of the exon 2 intron 2 region (+45T>G, +276G>T) of the ADIPOQ gene (also called ACDC, APM1, GBP28) encoding the adiponectin protein are associated with IR and adiponectin level,9, 10 and decreased adiponectin levels were reported to contribute to the atherogenic lipid profile.11 TNF −308G>A has been reported to be associated with IR.12 The functional LIPC −514C>T single nucleotide polymorphism (SNP) has been associated with plasma lipids levels.13 CARTPT −3608C>T has been associated with obesity14 and plasma lipid levels.15 PCSK9 E670G was reported to be associated with plasma LDL-C.16 The functional ENPP1 K121Q has been associated with IR and atherogenic phenotypes.17 Moreover, the K121Q and the IVS8 +27T>G, IVS20 −11 delT, +828 13 bp insertion, +1044A>G define haplotypes associated with IR and obesity.18 The SCAP I796V is involved in cholesterol homeostasis.19 Several SNPs (G2S, c.795C>T, c.1119C>T) from SCARB1 are associated with plasma lipid levels and LDL-C.20 Atherogenesis, for which LDL-C level may be considered as a marker, is a complex trait resulting from potential multiple gene–gene and gene–environment interactions. In dissecting multifactorial traits, conventional statistical approaches as multivariate logistic regression have been proved to be limited by a lack of power, or would require huge populations.21 To circumvent these limitations, we used both the multifactor dimensionality reduction (MDR)21 and the restricted partition methods (RPM).22 MDR is designed to detect high-order gene–gene or gene–environment interactions in relatively small samples. The method defines a new Boolean (high risk/low risk) variable summarizing information related to the multiloci and environmental informations. The new Boolean variable is then evaluated for its ability to classify and predict high- or low-risk status using cross validation testing. MDR analysis already showed its ability to detect susceptibility loci in various diseases.20, 23, 24, 25 The RPM method uses a partitioning algorithm for determining predictors of a quantitative trait. In this study including 846 subjects from a general population, among 19 SNPs from 10 genes and 3 phenotypic traits (gender, adiponectinemia and body mass index (BMI)) we aimed at determining the relevant genetic and environmental variables combinations involved in the modulation of plasma LDL-C level.

Materials and methods

Subjects

Participants were recruited in the framework of the World Health Organization-Multinational mONItoring of trends and determinants of CArdiovascular diseases (WHO-MONICA) population survey conducted from 1995 to 1997 in the urban community of Lille in the North of France. The sample included 1195 representative subjects (601 men/594 women) aged 35–64 years, stratified by town size and randomly selected from the electoral rolls to obtain 200 participants for each gender and 10-year age group (WHO-MONICA Project protocol). The Ethical Committee of Lille University Hospital (CHRU) approved the protocol. After signing an informed consent, participants were administered a standard questionnaire including personal medical history. Physical measurements were taken by a specially trained nurse. A fasting blood sample was drawn for 1170 participants (590 men and 580 women). The studied sample consisted of a subgroup of 846 subjects (424 men and 422 women; mean age=49±8 years, range 35–66; mean BMI=25.7±4.4 kg m−2, range 16.1–44.1) who were not treated for hypercholesterolemia, hypertension or type 2 diabetes mellitus. Adiponectinemia was measured using a commercial assay kit (LINCO Research, St Charles, MO, USA) as previously reported.10 Plasma LDL-C concentrations were calculated by the Friedewald formula.26 The population was divided into four BMI classes: class 1—BMI<25 kg m−2; class 2—25BMI<27 kg m−2; class 3—27BMI<30 kg m−2; and class 4—BMI30 kg m−2.

Genotyping

We genotyped SNPs with LightCycler LightTyper (Roche Diagnostics, Basel, Switzerland), or with TaqMan (Applied Biosystems, Foster City, CA, USA) or by direct sequencing. To avoid SNP genotyping errors, we systematically regenotyped 10% of DNA samples for further verification. We found concordance rates of 100% for all SNPs. All genotypes fitted the Hardy–Weinberg equilibrium.

Haplotyping

Linkage disequilibrium (LD) and haplotype blocks were determined with the Haploview software.27 Phase 2.128, 29 was used to construct haplotypes. Phase 2.1 implements a Bayesian statistical method to infer phase and to construct haplotypes from population genetics by Markov Chain–Monte Carlo algorithm and coalescent theory. It was shown to infer haplotypes more accurately than other Bayesian-based methods in real data sets.29

MDR analysis

For evaluation of high-order interactions among genetic and environmental variables with a relatively small sample size and a large variable number we used the MDR method.21 It includes a combined cross-validation procedure dividing the data into a training set and a testing set and thus minimizes false-positive results by multiple examinations of the data. With 10-fold cross-validations, the data are divided into 10 equal parts, and the model is developed on 9/10 of the data (training set). A set of n candidate variables is selected representing data in an n-dimensional space. The ratio number of cases/number of control subjects is evaluated within each multifactor cell in the n-dimensional space and thus cells are labeled as high or low risk according to the ratio of cases and controls. This reduces the n-dimensional model to one dimension (that is one variable with two multifactor classes: low risk and high risk). All possible combinations of n factors are evaluated sequentially for their ability to classify affected and unaffected subjects in the training set and the best n-variable model is then tested on the remaining 1/10 of the data (testing set). Steps are repeated 10 times with the data split into 10 different training and testing sets. In addition to the prediction error it is of interest to use cross-validation consistency as a measure of evidence for a particular model; that is how often were the same variables selected across the 10 cross-validations. We developed Perl scripts to perform repeated MDR analyses with numerous random partitions of the data between training and testing sets to better evaluate the reproducibility of the results.

RPM analysis

Conversely to MDR, RPM is an iterative algorithm devoted to the study of a quantitative trait. As reported by Culverhouse et al.30 RPM searches among multiple genetic and environmental factors the combination that explains the highest part of the variance of a quantitative trait. It attempts to find the most reasonable partition for evaluation, balancing maximization of the between-group variation with minimization of the number of groups and the within-group variation. In RPM analyses, interaction between variables may be evidenced when R-square of a combined model is at least 15% higher than the sum of R-squares of the variables taken individually (R Culverhouse, personal communication).

Results

Summarization of genetic variables using haplotype structures

In the general population studied here SNPs of the ADIPOQ gene showed strong LD defining two haplotype blocks: −11391G>A/−11377C>G (D′=0.997) and +45T>G/+276G>T (D′=0.990) as previously reported in other populations.10, 31, 32, 33 For each block, the haplotypes including both rare variant alleles were never observed (theoretical frequencies of the −11391A_−11377G and +45G_+276T haplotypes were 0.000029 and 0.000072, respectively). To include haplotypic information in further analyses, haplotypes at the ADIPOQ locus were inferred to each subject using the Phase 2.1 software. Only diplotypes configurations inferred with a posterior probability >0.95 were retained defining six diplotypes configurations for each haplotype block, thus, defining two variables. In agreement with previous studies reporting for the ADIPOQ gene, the presence of two LD blocks (in the promoter and the exon 2–intron 2) showing independent associations with adjusted adiponectin level,31, 32, 34 in our sample from a general population adiponectin level was modulated by ADIPOQ haplotypes from the above mentioned LD blocks (P=0.0013 and 0.0058, respectively). Among the other genes included in our study only ENPP1 was genotyped for several SNPs. The five ENPP1 SNPs defined two haplotype blocks, K121Q/IVS8+27T>G (D′=0.999) and +828 13 bp insertion/+1044A>G (D′=0.999) and one ‘independent’ IVS20-11delT not included in any of the haplotype blocks. Haplotypes were inferred for each block using the Phase 2.1 software. Diplotypes configurations inferred with a posterior probability >0.95 were retained defining six diplotypes configurations for each haplotype block. Thus, three variables were used for the ENPP1 gene in analyses: SNP IVS20-11delT, diplotypes of the K121Q/IVS8+27T>G and +828 13 bp insertion/+1044 A>G blocks. Regarding the SCARB1 gene, LD between the three SNPs was considered as weak (D′ between 0.2 and 0.7) thus they were included as three independent variables in the analyses.

MDR analysis of the LDL-C level

We divided our population in two clusters using a combination of HDL-C and LDL-C as a marker of dyslipidemia. High- and low-risk clusters for dyslipidemia were separately defined for men and for women according to the median for each subgroup. Subjects were considered as at risk if they had simultaneously HDL-Cmedian of the subgroup and LDL-Cmedian of the subgroup. In a less stringent way we classified the subjects in two clusters according to the median of LDL-C alone in our population. Both classifications were in total concordance (κ=0.999), thus we opted to use the LDL-C criteria alone as a marker of dyslipidemia. For a greater contrast, instead using two groups according to the median, analyses were performed between the 1st and 3rd tertiles of LDL-C level as MDR requires a binary trait.

Table 1 summarizes the results of an exhaustive MDR analysis that evaluated all possible one, two and three variables model from gender, BMI, adiponectin level and the 15 genetic variables summarizing 19 SNPs from 10 genes. Although the model including BMI as only variable showed a cross-validation consistency of 10/10, the overall best model included BMI and diplotypes of ADIPOQ +45T>G/+276G>T with a cross-validation consistency of 10/10 and a slightly better testing accuracy than BMI alone (0.590 vs 0.575). This two variables model was significant at the 0.0004 level. The OR was 3.13 (95% CI, 2.20–4.46]. One thousand repeated MDR analyses with random partitions of the data between training and testing sets showed BMI as the best one variable model, and BMI diplotypes of ADIPOQ +45T>G/+276G>T, as the best two variables model with 100% reproducibility. Moreover, the interaction dendogram of the MDR software showed an interaction between BMI and diplotypes of ADIPOQ +45T>G/+276G>T (data not shown).

Table 1 Results of MDR analyses with 3 phenotypic and 15 genetic variables summarizing 19 SNPs from 10 genes

RPM analysis of the LDL-C level

To analyze the determinants of LDL-C with another method we opted for the RPM algorithm able to use a quantitative trait. We first explored the models including one explaining variable. Among the 18 variables tested only 5 significantly partitioned our population in groups of different mean LDL-C (Table 2). As expected, BMI was the best explaining variable as the proportion of variation attributable to the partition (R-square) was the highest observed. We further explored the two variable models. Among the 153 possible models including two variables, the combination of BMI and diplotypes of ADIPOQ +45T>G/+276G>T showed the best R-square and was therefore assumed to be the best two variable model (Table 3). This model partitioned our population between a low (mean=3.57, s.d.=0.95) and a high LDL-C (mean=4.12, s.d.=1.02) group. The R-square of each variable alone was 0.0365 for BMI (Table 2) and was not different from zero for the diplotypes of ADIPOQ +45T>G/+276G>T (data not shown). Thus, R-square of the combined two variable model (R-square=0.0601) was 64% higher than the sum of R-squares (0.0365+0.000) of the variables taken individually, in agreement with an interaction between BMI and diplotypes of ADIPOQ +45T>G/+276G>T. A three variable model implies 816 combinations of variables for the 18 variables included. As our population consists of 846 subjects there is an obvious overfitting of the data and the results of the three variable models should not be taken into account.

Table 2 Results of univariate RPM analyses with 3 phenotypic and 15 genetic variables summarizing 19 SNPs from 10 genes
Table 3 Results of bivariate RPM analyses with 3 phenotypic and 15 genetic variables summarizing 19 SNPs from 10 genes

We compared the classifications obtained using both methods. The results are summarized in Table 4. There was a good concordance between both methods (κ=0.83; 95% CI, 0.61–1.00). Moreover, knowing the best combination of variables and their interaction, we included BMI, diplotypes of ADIPOQ +45T>G/+276G>T and their interaction term in a general linear model and observed a statistically significant model (P<0.0001) and a significant interaction (P=0.04). There was no significant difference between adiponectin levels adjusted by gender and BMI between the clusters of high and low LDL-C defined by MDR or RPM (P>0.05).

Table 4 Comparison of the classifications obtained with the MDR and the RPM methods as low and high LDL-C with the two variable models

Discussion

Our data using two mining methods show that among the variables included, BMI and a particular genetic status at the ADIPOQ adiponectin encoding gene locus appear as the best variables explaining LDL-C level. LDL-C was chosen as it is, among others, a well established marker of atherogenesis. Moreover, in most epidemio-genetic studies related to cardiovascular disorders or to dyslipidemia, the LDL-C phenotype is systematically investigated in view of its potential association with genetic determinants. Therefore, in this study besides gender, BMI and adiponectinemia we focused our attention on 19 SNPs from the PPARG, UCP3, ADIPOQ, TNF, LIPC, CARTPT, PCSK9, SCAP, SCARB1 and ENPP1 genes that have all been reported to be associated with IR and/or plasma lipid levels. For the ADIPOQ and ENPP1 genes where several SNPs were genotyped, we opted to use haplotypic structures as they better reflect the genetic architecture of the genes.18, 32 In haplotype analyses the population is partitioned into a larger number of strata than in SNP analyses and could be associated with a lower power. Nevertheless, when haplotypes better capture the genetic variation at a given locus they are more efficient in analyses than SNPs alone. It is commonly admitted and largely previously reported that haplotypes at the ADIPOQ and at the ENPP1 loci used in our study, better capture the genetic information than SNPs alone.9, 18 LDL-C phenotype is a complex trait resulting from potential multiple gene–gene and gene–environment interactions. Most of previously reported association studies report investigations of one SNP or one gene at a time. Multigenic and multifactorial conventional approaches as multivariate logistic regression including many explaining variables have been proved to be limited by a lack of statistical power,21 or would require huge populations and only investigate a limited number of interactions. MDR and RPM methods allow determining the best combination of variables and interactions that explain either a binary trait (MDR) or a quantitative trait (RPM). The MDR method is nonparametric and does not presuppose any mathematical relation between the variables (that is linear relation in the logistic regression). The RPM method is a robust method to examine quantitative phenotypes even if the loci have no single locus effect. Both independent methods showed quite similar results both on explaining variables and on the results of classification in low and high LDL-C groups. The best models with MDR and RPM classifications both included BMI and the genetic status at the ADIPOQ locus and their interaction. This model was ascertained at posteriori using a conventional statistical method. Regarding MDR analyses, improvement in testing accuracies scores when adding ADIPOQ diplotypes in the model was weak; this is not surprising as LDL-C is a multifactorial trait modulated among other by many genetic factors, each of them exhibiting a weak effect. Nevertheless, this weak effect was ascertained by RPM and by conventional statistical methods.

A slight discrepancy between MDR and RPM methods occurred for subjects with a BMI >30 kg m−2 (class 4) and a T_G/T_G ADIPOQ +45 +276 diplotype. It is noteworthy that this discrepancy only relies on 3 out of 47 subjects differently classified in MDR. Likewise, a discrepancy involving 3 subjects out of 25 occurred for subjects with a BMI under 25 kg m−2 (class 1) and a T_T/G_G ADIPOQ +45 +276 diplotype. Regarding the variance explained as expected in one variable models, BMI was the best predictor with twice as more variance explained as the best genetic variable (UCP3). It is noteworthy that ADIPOQ +45 +276 diplotypes alone had no detectable effect on LDL-C. This was ascertained in classical ANOVA (data not shown). Including this genetic variable (diplotypes of ADIPOQ) and BMI in the same model allows to quite double the variance explained. This increase only relies on interaction between the two variables. To disclose similar findings with classical statistical methods would have required to test 306 analyses: 153 including two explaining variables and 153 including the interaction term as additional variable. In this context it is unlikely that a significant result would have been disclosed according to corrections for multiple testing. In this context these RPM and MDR methods appear as suitable opportunities to perform data mining to dissect complex diseases where multiple genetic determinants, environmental factors and their interactions may be involved.

Together with gender, BMI and genetic variables, adiponectin level was included in analyses. Surprisingly, if genotypes at the ADIPOQ adiponectin encoding gene were among the factors that associated with LDL-C level, adiponectin level itself was not. This is not amazing as only 3% of the variance of serum adiponectin level is explained by haplotypes at the ADIPOQ locus.9 As adiponectin levels adjusted by gender and BMI were similar in the clusters of high and low LDL-C defined by MDR or RPM, ADIPOQ genotypes classified as at risk for LDL-C level do not reflect a significant variation of adjusted adiponectin level and we can exclude that genotypes at the ADIPOQ locus influence LDL-C level through a modulation of adiponectin level. This is in agreement with previous results showing that if genetic variants of the ADIPOQ gene modulate adiponectin level, they are not contributing to the genetic linkage with the metabolic syndrome at the 3q27 locus.9, 35 In addition, as adiponectin level is not a determinant of LDL-C level, it seems logical that the PPARG gene was not among the factors that modulated LDL-C level although the adiponectin gene contains a PPARgamma responsive element. Genetic variations in the remaining genes included in this study (UCP3, TNF, LIPC, CARTPT, PCSK9, SCAP, SCARB1 and ENPP1), although individually associated with IR, metabolic syndrome and/or plasma LDL-C level, are not discriminant for LDL-C level in our multifactorial analyses.

As genome-wide scans and one meta-analysis reported highly significant linkage with coronary heart disease and LDL-C level at the 3q27 locus36, 37, 38, 39 we hypothesize that haplotype blocks of the ADIPOQ gene would capture genetic variation(s) from neighboring gene(s) that would be modulating metabolic syndrome, coronary heart disease and LDL-C level. More than 20 genes map to the 3q27 locus. If most do not appear as putative candidate (FETUB, DNAJB11, CRYGS, RTP4, DGKG…) several have already been reported to be associated with phenotypes of the metabolic syndrome; that is the alpha 2 Heremans–Schmid glycoprotein (AHSG) gene is associated with type 2 diabetes,40 kininogen encoded by the KNG1 gene is involved in insulin sensitivity at least in rodents,41 the eukaryotic translation initiation factor 4 alpha 2 (EIF4A2) gene contributes to the linkage with type 2 diabetes.42 Additional investigations including at least genetic variants from these genes will be required to better define the 3q27 contribution to metabolic syndrome, coronary heart disease and LDL-C level. Moreover, a similar analysis in another independent population sample would be very instructive. Anyway our data show that data mining methods such as MDR and RPM are quite suitable to dissect multifactorial traits in relatively small samples, and detect the most prominent determinants among numerous genetic and environmental variables and their complex interactions.