Type 2 diabetes mellitus (T2DM) is a chronic metabolic disease with a complex pathogenesis defined by genetic predisposition and environmental factors1. In Lebanon, T2DM was evaluated on 8,050 Lebanese cases in 1990. Its prevalence and incidence, similar to the international averages, were 5.0% and 1.5 to 1.7% respectively2. In 1992, a study on 436 cases from all over the Lebanese territory gave prevalence of 7 to 8% for T2DM and 10 to 11% for impaired glucose tolerance3. In 2005, a study on 3,000 exclusively Lebanese individuals from Greater Beirut showed a prevalence of 11.3% which increased with older age4. The combined prevalence of previously and newly diagnosed T2DM was 15.8%. At that time, in the U.S., 6.3% of the population had T2DM: 4.5% diagnosed and 1.8% undiagnosed according to the 2004 National Diabetes Fact Sheet. These results suggest that the prevalence of T2DM in Greater Beirut is relatively high and is increasing among the Lebanese population.

Furthermore, according to the latest figures in the American National Diabetes Statistics Report, 29.1 million children and adults in the United States have diabetes and 86 million people have prediabetes5. The pathogenesis of T2DM is closely associated with a positive family history, male gender, age over 45 years, overweight, hypertension and abnormal lipid levels. In addition, the genetic contribution to T2DM is well recognized with a total of 91 established associated susceptibility loci6,7,8,9,10,11,12,13,14,15,16,17. The common variants in the reported loci however account for only a small proportion of the heritability of T2DM and the functional role of most of these variants remains far from clear. It is possible that a large number of highly-penetrant but rare T2DM susceptibility genetic variants remain to be identified. Additional genome-wide explorations including whole genome and exome sequencing in well-established groups of patients and controls may unravel these additional important genetic disease contributors which will undoubtedly help us understand the complex mechanisms involved in the development of T2DM.

The prevalence of T2DM is distinctly variable across populations and this variability adds to the disease complexity. This variability could be due to differences in lifestyle factors such as dietary habits, as well as behavioral patterns among populations. Multiple established T2DM susceptibility genetic loci have been identified from previous Genome Wide Association Studies (GWAS) in populations of European and Asian ethnicities17,18,19. It is however, equally important to replicate the behavior of these previously discovered associated variants in ethnically different populations and to identify novel predisposing genetic factors that may be stronger in their association than in previously studied populations.

To identify T2DM susceptibility loci in the Lebanese population, we performed a GWAS of more than 5,000,000 SNPs, which were directly genotyped or imputed using the 1000 Genomes Project reference panels. Our study enrolled 1,388 patients and 1,902 non-diabetic subjects with detailed cross-sectional demographic and clinical information. Previously, targeted T2DM susceptibility SNP replication studies have been conducted in the Lebanese population. However, this study seeks to establish a baseline association for T2DM in the Lebanese population and thus it is the first genome-wide scan to find genomic regions implicated in T2DM. We also make use of recent reference data from the 1000 Genomes Project, which was previously shown to identify novel and refined associations for type 2 diabetes mellitus20,21.


The study population has a mean age of 62.9 (±11.1) (Table 1). The healthy control group has a mean age of 62.40 (±11.82), compared to 63.62 ± (9.98) for the T2DM patients. Sixty-three percent of the individuals were males with 40.05 percent suffering from T2DM. The mean BMI was of 28.1 (±4.6) according to standard measures. 52 samples that failed genotype call rate (<95%)27 and that showed relatedness (kinship coefficient range > 0.0442)28, were removed.

Table 1 Distribution in the surveyed population of age, gender, BMI and coronary artery disease by T2DM diagnostic status


37995922 sites were imputed. The final set included 5891794 SNPs kept after QC (MAF > 0.05, AIMs removal). The accuracy of the genotype imputation depends on the array SNP coverage and the similarity of haplotypes between the study dataset and the reference panels. The Lebanese population was previously shown to be genetically very close to Europeans29 that are well represented in the 1000 Genomes Project references. This resulted in high imputation accuracy as shown by Impute2 certainty metric (Figure 1). Figure 1 shows biased imputation confidence depending on chip with the HumanOmniExpress (~700,000 SNPs) having better imputation accuracy compared to the Quad BeadChips (~550,000 SNPs). However, leading principal components reveals no bias regarding population coverage or stratification between chips (FigureS1).

Figure 1
figure 1

Imputation accuracy.

Per-sample imputation confidence scores between true genotypes and genotypes predicted by imputation, averaged over imputation chunks. Accuracy increases with increased array SNP coverage.

Association with T2DM

To map genetic loci associated with T2DM, genotyped and imputed data in 3,286 Lebanese individuals were tested for association with the disease (Figure 2). We identify variations located on chromosomes 4, 6, 9, 10, 12 and 18 that show trends of association with T2DM reaching P < 10−5 under an additive model (Supplementary Table 1). Seven variants located in CDKAL1 (cyclin-dependent kinase 5 regulatory subunit associated protein 1-like 1) reached genome-wide significance (P < 5 × 10−8) with lead signal rs7766070 (OR = 1.38, P = 9.37 × 10−9) (Figure 3, Table 2).

Table 2 T2DM GWA results showing SNPs that reached genome-wide significance (<5 × 10−8) in the initial testing phase or after conditioning for sex, age and BMI
Figure 2
figure 2

Manhattan plot.

Plot shows results of T2DM GWA analysis assuming additive impact in 3286 Lebanese using >5,000,000 genotyped or imputed SNPs. The Y-axis corresponds to the significance of the association (−log10 p-values). The X-axis represents the physical location of the variant colored by chromosome.

Figure 3
figure 3

Regional association plots for the TCF7L2 (A) and CDKAL1 (B) genes.

Each regional plot shows the chromosomal position (hg 19) of SNPs in the specific region against −log10 p values estimated assuming additive impact. The SNP with the highest association signal at each locus is shown as a purple star; the other SNPs are colored according to the extent of LD with that SNP. Estimated recombination rates from the 1000 Genomes Project European population (release March 2012) are shown as light blue lines.

Adjustment for sex, age and BMI (Table 2) identified 16 additional genome-wide significant variants in CDKAL1 and TCF7L2 (Transcription factor 7-like 2) and showed that association of these two loci with T2DM appear to be independent from age and BMI. These variants are also added to Table S1 along with imputation scores.

The genomic control inflation factor (λ) which compares observed association statistics against the expected distribution, was 1.073 suggesting possible but marginal over-dispersion of the association statistics (Figure 4)30.

Figure 4
figure 4

Quantile-Quantile plot of the GWAS results.

Plot compares observed −log10 p values of the tested SNPs on the vertical axis to expected −log10 p values under the null hypothesis on the horizontal axis. The genomic control ratio was 1.073, indicating the lack of strong effect of systematic error such as population stratification.

SNPs of interest that showed trend of association but did not reach genome-wide significance were 12 variants on chromosome 9 (Supplementary Table 1), overlapping a region with microRNA 3910-1 and 3910-2 downstream the gene ROR2 (receptor tyrosine kinase-like orphan receptor 2).

Three of the SNPs, rs7766070, rs9368222 and rs34872471, identified in Table 2 were significant in DIAGRAM consortium6,7,9,10,11,15,17. One other, rs1044083315, was identified but not genome wide significant (Table S2), but which may be of particular interest in Middle Eastern populations31. Table S2 lists DIAGRAM consortium study SNPs along with comparisons with our results and imputation confidence scores.


T2DM global prevalence has been steadily increasing over the past 50 years and is today considered a major international health concern. In addition to environmental factors such as diet and lifestyle, genetic susceptibility appears to have substantial role in the etiology of T2DM. Furthermore, T2DM risk alleles show significant disparity in frequencies across worldwide populations. Studies analyzing distribution patterns of genetic diseases found that T2DM risk alleles demonstrated the most extreme differentiation32, with population frequencies decreasing from Sub-Saharan Africa and through Europe to East Asia32,33.

The current work contributes to the ongoing worldwide effort to identify the genetic factors that affect the risk of T2DM. We provide analysis in a novel population that is highly admixed with ancestral components from the Middle East, Europe, Central Asia and sub-Saharan Africa29. The risk alleles identified in the Lebanese population could assist in understanding the disparity in T2DM prevalence across worldwide populations.

We report the first genome-wide scan for genetic susceptibility to T2DM in Lebanese subjects. We demonstrate strong association of T2DM with CDKAL1 and TCF7L2 genes previously shown to predispose to this condition15,18,19,24,26,34,35,36,37. A large number of genetic variants have recently been identified to be associated with T2DM in GWASs in Europeans6,10,15,17,19,25. The elucidation of the genetic mechanisms responsible for the heterogeneous condition of T2DM in different ethnicities may give important contributions to better understand the complexity of this disease. While some associations are consistent across different ethnic groups38,39, some, like the TCF7L2 variants, have been found to be heterogeneous6,25. Our study confirmed genome-wide association of 23 different variants across the CDKAL1 and TCF7L2 genes previously reported to play a role in T2DM in different populations23,40,41. This association remained significant after adjustment for sex, age and BMI, suggesting that these risk factors are not likely to be major contributors to the observed association.

After adjusting for age and BMI, most variants in CDKAL1 and TCF7L2 showed stronger association to T2DM, indicating that these genes confer susceptibility to T2DM independently from the classical clinical risks. Functional studies of these two genes show their implication in down-regulating blood glucose level. CDKAL1 encodes a 65 kD protein of unknown function expressed in pancreatic islet and skeletal muscle26. CDKAL1 shows homology with CDK5 regulatory subunit associated protein 1 (CDK5RAP1), an inhibitor of CDK5 activation. CDK5 plays a role in the regulation of pancreatic beta cell function by down-regulating insulin expression42,43. The TCL7L2 gene product is a high mobility group (HMG) box-containing transcription factor implicated in blood glucose homeostasis. TCL7L2 acts through regulation of proglucagon through repression of the proglucagon gene in enteroendocrine cells via the Wnt signaling pathway44.

T2DM is a socio-economic burden in Lebanon and genetic studies aim at posing important public health questions in regard to strategies for diagnosis, treatment and prevention of the disease. Previous replication studies examining a total of 29 SNPs in the Lebanese population have identified variants associated with T2DM in COL8A1, KCNQ1, ALX4, HNF145 (n = 2071), CDKAL150 (n = 1422) and TCF7L246,47 (n = 1610).

Our genome-wide analysis of more than 5,000,000 SNPs in the largest T2DM study to date in the region has only identified variations in the CDKAL1 and TCF7L2 genes to be genome wide significantly associated with the disease.

Lebanon represents a Levantine region with higher European affinity than other parts of the Middle East48 and therefore potentially fills in the map between SNPs identified in European populations contrasted with other Middle Eastern populations. Further, since T2DM is an emerging issue within the Middle East, these populations may differentially highlight some variants distinct from other regions. Even accepting relatively low power compared to that available in meta-studies, the moderately small number of replicated GWS significant SNPs in our study is less than expected, especially given the affinity of Lebanon with other European populations48.

Our results contribute to ongoing efforts to identify and confirm risk alleles associated with T2DM. Mapping disease alleles in different populations can provide important perspectives in disease diagnosis and steer functional studies to provide better understanding of the pathogenesis of T2DM.


Study Participants

The Lebanese study participants were all of Lebanese origin and were recruited to this study in three phases. A first recruitment campaign, conducted with the collaboration of the Lebanese University Medical Center, led to the recruitment of 506 subjects from the suburbs of Beirut, the capital of Lebanon. In a second campaign conducted in North Lebanon, 492 subjects were successfully recruited. The 2,292 remaining participants were recruited as part of the Functional Genomic Diagnostic Tools for Coronary Artery Disease project initiative (FGENTCARD) from the Rafic Hariri University Hospital and the “Centre Hospitalier du Nord” in Lebanon49,50. Research and methods were carried out in compliance with the Helsinki Declaration and with the approval of the LAU institutional review board and local ethics committees on human research (Reference number SMPZ08072010-4). All participants signed an informed consent and data and blood samples were obtained from each individual. By taking part in the two recruitment campaigns, participants (1) answered a detailed questionnaire, (2) gave a blood sample for DNA analysis and (3) gave a blood sample for HbA1C, fasting blood glucose (FBS) and lipid profile measures after 12 hours fasting. For the remaining FGENTCARD 2,292 participants, a detailed questionnaire was duly filled, a blood sample was obtained for DNA and metabolites analysis and annotations were coded from medical charts for data such as laboratory tests, prescribed medications and presence of clinical conditions. Table 1 describes clinical and demographic characteristics of the study participants49. Previously reported51 analysis of these subjects showed association with recruitment for questionnaire variables, such as “exercise level,” “smoking usage”. However, blood panel associations showed very consistent odds ratios for T2DM and coronary artery disease risk independently of recruitment.

Selection of Patients and Controls

For the 998 participants selected through the two recruitment campaigns, an HbA1C of 48 mmol/mol (6.5%) was used as the cut-off point for diagnosing T2DM in line with the World Health Organization definition of type 2 diabetes mellitus diagnosis52. For the 2,292 participants from the FGENTCARD project, T2DM was diagnosed by an ascertained physician supported by subjects HbA1C levels and or their two-hour plasma glucose concentration after an oral glucose tolerance test as documented in their medical records. For the control dataset, 85.23% of the participants were ≥50 years old and have no history of T2DM. Body mass index (BMI) was calculated according to standard measurements. Selection criteria for DNA analysis were based on the complete availability of the genotyping data and the T2DM-relevant characteristics, resulting in the selection of 3,286 subjects.

Genotyping and imputation

DNA was extracted using a standard phenol-chloroform extraction procedure. Samples were genotyped using Illumina HumanOmniExpress-12V1-1 Multi-use (623 controls, 839 cases, ~700,000 SNPs) and Illumina Human610-660W Quad BeadChips (1279 controls, 545 cases, ~550,000 SNPs). Plink27 was used for data management and quality control keeping 3,286 samples with call rate ≥95%, SNPs call rate ≥98%, Hardy-Weinberg p > 10−6, MAF ≥ 1%. Genotype data was converted to NCBI genome build 37 using LiftOver ( Chromosomes were phased with SHAPEIT53 using the 1000 Genomes Phase I haplotypes54 provided on the IMPUTE2 website (1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2), September 2013) based on sequence data for 1,092 TGP samples from release 20110521. IMPUTE255 was employed to impute genotypes in blocks of ~5 Mb using all populations in the 1000 Genomes Phase I data. Accuracy of imputation was measured using IMPUTE2's information metric “info”.

Statistical analysis

GWAS can be confounded by population stratification. The most common approach to correct for population stratification is to include the principal components from the PCA as covariates. However, there is no solid method to determine how many components to include and selection is usually subjective with inconclusive results. Correction for stratification becomes even more complicated in populations like the Lebanese where genetic structure is recent and not driven by the usual factors of distance and geography29. Furthermore, previous studies have shown that historical migrations to Lebanon have potentially created disease-associated population structures that are similar to those in the source populations22,56. In this study, we use insights from our previous studies on population stratifications in the Lebanese to avoid type 1 errors in the GWAS.

We apply a method that uses available 1000 Genomes Project reference panels to impute a resolved ancestry dataset29 as well as the current T2DM cohort to the same SNP coverage resolution. We next use two model-based clustering methods to unambiguously classify subjects into groups (Figure 5):

  • 1- mclust ( uses model-based hierarchical clustering and Bayesian Information Criterion to classify samples into related groups.

  • 2- ADMIXTURE57 assigns individuals in a model-based manner into ancestral populations using maximum likelihood estimates.

Figure 5
figure 5

Population structure in the Lebanese.

A) Principal component analysis shows an orthogonal population structure. Model-based clustering using mclust identifies three groups similar to previously reported stratification in the Lebanese. B) Estimation of individual ancestry using ADMIXTURE at K = 3 shows three groups with disparate proportions of ancestral allele frequencies.

The two methods identified similar population structures and assigned samples into three ancestral groups (Figure 5) similar to previous ancestry reports on the Lebanese29. We next proceed by extracting 192,310 ancestry informative markers (AIMs) between the groups and exclude it from the GWA analysis. We have also tested our method by using the identified ancestry groups as covariates in the GWAS and obtained identical results.

Association tests were performed using SNPTEST v2.458, employing frequentist association tests and taking into account uncertainty in imputed genotypes (calling threshold 0.9). At each SNP, an additive, dominant, or recessive model was tested versus a model of no association. Association tests were also conditioned on age, sex and BMI to test for a genetic effect over and above that explained by these covariates. Chip type and enrollment were also both included in a separate analysis; results showed correlated but no genome-wide significant SNPs. Further, a PCA painted by chip type (Figure S1) shows no significant bias. LocusZoom59 was used to plot regional association results from the genome-wide association scans.