Introduction

Sickle-cell disease (SCD) is the most common autosomal recessive blood disorder in the United States, affecting approximately 1 in 400 African Americans,1 and causes considerable morbidity and mortality.2 The clinical manifestations of SCD include marked phenotypic heterogeneity, with involvement of genetic as well as environmental factors.3 It is well established that increases in fetal hemoglobin (HbF) can decrease the severity of SCD, because of its ability to inhibit the polymerization of sickle hemoglobin (HbS). Early clinical observations demonstrated that blood from infants with SCD showed little sickling compared with older patients, and that SCD patients with hereditary persistence of HbF production had less severe complications of their disease.4, 5 More recently, Platt et al.2, 6 saw a benefit to incremental increases in baseline HbF in terms of painful crises and mortality in patients with SCD. The ameliorating effect of HbF on SCD and other β-hemoglobinopathies has generated intense interest in understanding the control of HbF expression in adults.

HbF predominates in the fetus, but postnatally it declines to extremely low levels and is restricted to a sub-population of erythrocytes termed F-cells.7 In normal individuals, residual HbF synthesis continues throughout adult life, and HbF and F-cells are closely correlated traits (r2>0.9).8, 9, 10 The heritability of F-cell levels was estimated to be 90% in an European population,11 indicating that the expression of the γ-globin gene in adults is under strong genetic control. Previous studies have shown that levels of HbF are influenced by several factors, including age,12, 13 sex13 and a sequence variant (C → T) at position −158 upstream of the γ-globin gene (11p16), commonly referred as the XmnI-Gγ polymorphism.14, 15 Linkage and association studies for HbF levels mapped quantitative trait loci (QTLs) to chromosome 6q23, 8q and Xp22.2.16, 17, 18 In last few years, several genome-wide association studies (GWAS) in different ethnic groups have established three major QTLs (BCL11A at 2p15, HBS1L-MYB intergenic region at 6q23 and XmnI-Gγ at 11p16), accounting for 20–50% of phenotypic variation in HbF and F-cell levels.19, 20, 21, 22 Recently, fine mapping of HbF-associated signals confirms the association of these QTLs and ruled out the previously proposed XmnI-Gγ polymorphism as an independent causal variant for HbF regulation.23 Two of these three QTLs, MYB and BCL11A are oncogenes, and emphasize the importance of cell proliferation and differentiation in HbF/F-cell regulation. Subsequently, BCL11A (a zinc-finger transcription factor) was shown to function as a regulator of HbF and it was hypothesized that it might repress expression of the γ-globin gene directly by interacting with cis-regulatory elements within the β-globin cluster, or indirectly by modulating cellular pathways that affect HbF expression.24 Interestingly, all the reported significant single-nucleotide polymorphisms (SNPs) of BCL11A reside within a region of 14 kb in intron 2 of the gene, and are in moderate-to-high linkage disequilibrium (LD) in African ancestry in Southwest USA (ASW) samples from the HapMap project,25 suggesting that they all tag the same genetic signal in that region. To map additional QTLs and gain more insight of the genetic regulation of F-cells, we performed a GWAS on the proportion of F-cells in African ancestry individuals, namely, patients with SCD from the Silent Infarct Transfusion (SIT) Trial cohort.

Materials and methods

Study and population samples

The SIT Trial is an international, multi-center clinical study funded by the National Institute of Neurological Disorders and Stroke (http://sitstudy.wustl.edu/).26 The study protocol was approved by the Institutional Review Board at the Johns Hopkins University School of Medicine and conducted in accordance with institutional guidelines. Samples were taken pre-transfusion and for each patient, DNA was collected from Epstein–Barr virus transformed lymphoblasts using Puregene Genomic DNA Purification kits (Gentra Systems, Minneapolis, MN, USA). Demographic and phenotypic information were collected for each participant and the inclusion criteria for the recruitment were age (5–15 years) and hemoglobinopathy diagnosis (either Hb SS or Hb Sβ0-thalassemia). Details of the study design are given elsewhere.26

Phenotypic assessment of HbF

Peripheral blood drawn from SIT Trial study subjects was mixed 1:1 with Alsever's solution (Sigma, St Louis, MO, USA), stored at 4 °C, and analyzed within 1 week of being drawn. For each subject, F-cells were enumerated using R-Phycoerythrin-conjugated monoclonal antibody directed to HbF (Invitrogen, Camarillo, CA, USA) following the manufacturer's instructions. Negative controls were prepared using isotype-matched nonspecific Phycoerythrin-conjugated antibody (Beckman Coulter, Fullerton, CA, USA). Analysis of 10 000–30 000 cells per tube was performed on a FACScan flow cytometer (Becton Dickinson, Franklin Lakes, NJ, USA) and data was analyzed using CellQuest software (Becton Dickinson). Fetaltrol (Trillium diagnostics, LLC, Brewer, ME, USA) was used as a tri-level fetal red cell control following the manufacturer's instructions. All of the F-cell determinations were carried out in a central laboratory at the time of the SIT Trial blood draw.

Genotyping

Samples were genotyped with the Illumina HumanHap650Y array (Illumina Inc., San Diego, CA, USA), which interrogates approximately 661 000 SNPs, of which 100 000 were selected as tags for populations with African ancestry.27 The Beadstudio software (Illumina Inc.) was used to cluster the data, and samples with <96.5% call rates were re-genotyped. A total of 24 International HapMap Consortium25 controls and 13 known duplicates were also genotyped. The reproducibility, calculated from duplicate pairs, was 99.98% and genotype concordance with HapMap data was 99.76%.

Quality control

Cryptic relatedness in the cohort was determined by examining pairwise identity-by-state, and 33 samples were identified as first-degree relatives (full or half siblings) and excluded from the study. Additionally, eight samples, because of missing covariate data, were also excluded from the analysis. Given the admixed nature of African Americans, we used principal component analysis as implemented in EIGENSTRAT28 to both identify genetic outliers (>6 s.d.'s on any of the top 10 principal components) and correct for any potential residual population substructure. Six individuals were identified as genetic outliers and were dropped from the study, leaving 440 individuals for the subsequent analysis. Variance inflation factor for genomic control (λGC) was estimated as described by Devlin and Roeder29 to test for residual relatedness and/or population substructure.

Statistical analysis

All association analysis and quality control measures were performed using the PLINK software package,30 version 1.06 (http://pngu.mgh.harvard.edu/purcell/plink/). F-cell distribution in the studied samples was slightly skewed; therefore, Box–Cox power transformation (λ=0.6) was applied to approximate the normal distribution.31 The effect of each SNP on the proportion of F-cell levels was assessed by adjusting for age, sex and top 10 principal components in a multivariate linear regression by assuming an additive genetic model of inheritance. The genome-wide significant threshold was determined by permutation (P-value <1.27 × 10−7). R statistical computing environment (http://www.r-project.org/) (version 2.9.0) was used to generate quantile–quantile (Q–Q), Manhattan and regional association plots. To create a more comprehensive fine map of the SNPs from the observed genome-wide significant loci, imputation was performed by using Hidden Markov model as implemented in the MACH software (version 1.0.16) (http://www.sph.umich.edu/csg/abecasis/MACH/).32 The YRI and CEU combined panel, from the 1000 Genomes Project,33 was used as a reference population, and 100 iterations were used to estimate model parameters. To account for the uncertainty of imputed data, the estimated allele dosage was analyzed using ProABEL34 under a linear regression framework. Standard quality metrics were applied and only SNPs with high-quality (r2>0.8) score were considered for the analysis. Further, the LD patterns within the surrounding region of the significant SNPs were constructed using the solid spine method,35 as implemented in Haploview36 (version 4.1) (http://www.broad.mit.edu/mpg/haploview/index.php). Haplotype inferences were carried out using a Bayesian statistical method implemented in PHASE software (version 2.1) (http://www.stat.washington.edu/stephens/).37 Default settings of 100 iterations, 100 burn-in steps and 1 thinning interval were used to infer most likely pairs of haplotypes for each individual. Inferred haplotype diversity was represented by means of a cladogram, which is constructed using hierarchical Ward's clustering method. Haplotype-based association analysis was performed using generalized linear regression, assuming an additive genetic model, and adjusted for age, sex and the first 10 principal components.

Results

Genome-wide single SNP association

Genome-wide association was performed for the proportion of F-cells in 440 individuals (232 men, 208 women) from the SIT Trial cohort. The average age of the cohort was 9.15 years, with 53% males. Detailed demographic and clinical characteristics for the study subjects are described in Table 1. A Q–Q plot of the observed P-values versus expected is shown in Figure 1a. The observed P-values show no early departure from the null, suggesting that our findings are unlikely to be influenced by poor genotyping, sample relatedness or population stratification. The genomic control (λGC=1.001) from the analyzed 660 740 SNPs also suggests minimal population stratification. The distribution of association P-values (Manhattan plot) for F-cell levels is shown in Figure 1b. We observe a genome-wide significant finding at the previously implicated chromosome 2p15 region and confirm the association of the BCL11A locus in the modulation of F-cell levels. In total, four SNPs from this locus were observed to be genome-wide significant (permuted threshold <1.27 × 10−7), and includes the SNP, rs766432 (P-value <3.32 × 10−13), which previously has been reported to be associated with HbF/F-cell levels in diverse populations.19, 22, 38, 39 We also identified an additional BCL11A intronic SNP (rs6706648), with more significant association with F-cell levels (P-value <4.71 × 10−14) (Table 2). In our study, the ancestral alleles of both rs766432 and rs6706648 SNPs are observed to be associated with lower F-cell levels (rs766432, β=−1.49; rs6706648, β=−1.42) (Table 2). Distributions of the proportion of F-cell levels within each genotype group of these two SNPs are shown in Figure 2a. Individuals who were homozygous for ancestral alleles at rs6706648 (TT) and rs766432 (AA) loci have two times lower F-cells, when compared to those who were homozygous for the derived alleles at both loci (Figure 2b). In our samples, we did not observe any individual carrying homozygous-derived alleles at rs766432 and homozygous ancestral alleles (TT) at the rs6706648 locus. Our sentinel SNP, rs6706648, is located in intron 2 of the BCL11A region and observed in low LD with rs766432 (r2=0.25) (Figure 3a). The additional two significantly associated SNPs from the same region, rs6709302 and rs10195871, were observed in moderate-to-high LD with either rs6706648 (r2=0.52) or rs766432 (r2=0.85), respectively; hence, likely tagging the same signal (Supplementary Figure 1). To test for independent genetic effects of rs6706648 and rs766432 in the BCL11A region, we performed conditional multivariate regression analysis. The conditional analyses results are shown in Table 3, and indicate that both SNPs (rs6706648 and rs766432) have independent genetic effects. Independently, these two SNPs explain 12% (rs6706648) and 11% (rs766432) of the F-cell variance in patients with SCD, whereas together they explain 15%. Although no other loci are genome-wide significant, several loci showed suggestive association with F-cell levels (Supplementary Table 1).

Table 1 Demographic and clinical characteristics
Figure 1
figure 1

Summary of the genome-wide association results of the proportion of F-cells in the SIT Trial cohort. (a) Q–Q plot of the observed versus the expected P-values from an additive genetic model for the entire set of 660 740 SNPs (red), and after removing genome-wide significant and their neighboring±100 kb region SNPs (yellow). (b) Manhattan plot for F-cells association results plotted against the position on each chromosome. The red color peak on chromosome 2 corresponds to the BCL11A region (±100 kb SNPs from rs6706648) and the red horizontal line represents a permutation-based genome-wide significant threshold (P-value <1.27 × 10−7).

Table 2 Genome-wide significant SNPs associated with the proportion of F-cells in the SIT Trial cohort
Figure 2
figure 2

Distribution of F-cells by observed genotypes. (a) Box plot demonstrating the distribution of untransformed F-cells within each genotype group of rs6706648 and rs766432 SNPs. Each box represents the F-cell values between the 25th and 75th quartiles, and the dark black line within the boxes indicates the median values. (b) Mean distribution of the untransformed F-cells in individuals by genotype combination. The number of individuals within each genotype combination is shown in parentheses.

Figure 3
figure 3

Regional association plot of genotyped and imputed SNPs from the BCL11A intron 2 region. (a) Genotyped SNPs are plotted with their P-values (−log10 P) as a function of genomic position (UCSC human genome hg18 coordinates). Estimated recombination rates observed in HapMap YRI (using a window of±200 Kb) are plotted to reflect the local LD structure around rs6706648. (b) Regional association plot showing the significance of the imputed BCL11A SNPs. The significance of the imputed SNPs is plotted with their P-values (−log10 values) as a function of genomic position (UCSC human genome hg18 coordinates). Estimated recombination rates observed in HapMap YRI are plotted to reflect the local LD structure around the most significant SNP (rs7606173) observed in our study. The imputation of the plotted SNPs were performed using the YRI and CEU combined reference panel from the 1000 Genomes Project.33

Table 3 Association summary of genome-wide significant SNPs from conditional multivariate regression analysis

Imputation-based association testing

Using the YRI and CEU combined reference panel from the 1000 genomes project, imputation was performed over a 90-kb interval centered on rs6706648. In total, 102 SNPs were imputed from the BCL11A region and after dropping low-quality imputed SNPs (r2<0.8), 42 SNPs were analyzed. Among these high-quality imputed SNPs, 17 common variants were observed significant at the genome-wide threshold (Supplementary Table 2). The strongest evidence for association was observed for rs7606173 (P-value <5.14 × 10−16), which is 3.4 kb downstream to rs6706648 and is in high correlation (r2=0.91) (Figure 3b, Supplementary Figure 1). In our study, we replicate the association of all previously reported BCL11A SNPs19, 20, 21, 22, 23 and no other SNP (except rs7606173) was observed to be more significant than rs6706648 (Supplementary Table 2). The ancestral allele (G) of rs7606173 is the major allele (frequency: 0.55) and is associated with higher F-cell levels (β=1.50) (Supplementary Table 2). Among 17 genome-wide significant SNPs, 10 SNPs were observed with high LD (r2>0.85) to rs766432 (Supplementary Table 2, Supplementary Figure 1). To identify independent genetic effects, conditional regression analysis was performed on imputed and genotyped SNPs. Using conditional regression, we demonstrate that rs7606173 and rs766432 account for the genetic effects on F-cell levels in the BCL11A region (Supplementary Table 3). The variance explained by rs7606173 is 13%, and together with rs766432 they explain 16% of F-cell variability.

Haplotype association analysis

To gain additional insight into the genetic association observed in the BCL11A intron 2 region, haplotype-based association analysis was performed. In total, seven haplotypes (frequency >1%) were inferred from the LD block containing rs766432 and rs7606173 (Figure 4). To generate haplotype clusters, the ancestral allele for each SNP was determined unambiguously by comparing sequence similarity with non-human primates. On this basis, all the inferred haplotypes are grouped into three clusters (I, II and III). Cluster I contains 42% of total haplotypes and has ancestral alleles at the majority of the loci, and was therefore used as the reference haplotype cluster. Using a linear regression framework, we observed that haplotypes from the reference cluster are associated with the lowest F-cell levels, whereas cluster III haplotypes were observed with the highest levels (Figure 4). As expected, cluster II haplotypes, containing derived alleles at both rs766432 and rs7606173 loci, were observed with an intermediate effect. Further, in reference to the ancestral cluster (cluster I), we observed an apparent additive effect for clusters II and III. The effect size estimate of cluster III haplotypes is two-fold higher (β=1.98) than that observed for cluster II haplotypes (β=1.07) (Figure 4). The F-cell variances explained by cluster III haplotypes alone and together with cluster II haplotypes were 11 and 16%, respectively.

Figure 4
figure 4

Genetic association of inferred haplotypes from the rs766432 and rs7606173 region LD block. In total, seven common haplotypes (frequency >1%) were inferred using 20 SNPs from the rs766432 and rs7606173 region LD block. The top panel provides the information of the ancestral alleles based on the similarity with non-human primates. The period symbol (‘.’) indicates the ancestral allele and letters in place of period symbol indicate derived alleles. The inferred haplotypes are grouped in three clusters (I, II and III) and cluster I is defined as the ‘reference haplotype cluster’.

Sex-stratified analysis

Given known sex-specific differences in HbF/F-cell levels,13 we also performed sex-stratified genome-wide analysis (Supplementary Tables 4 and 5). In men, we report an additional locus on chromosome 17p13.3, glucagon-like peptide-2 receptor (GLP2R), showing genome-wide significance for F-cell levels (rs12103880; P-value <3.41 × 10−8) (Supplementary Table 5). SNP rs12103880 is located at the 5 UTR region of GLP2R and the ancestral allele (G) is associated with lower F-cell levels (β=−1.36) (Supplementary Table 5).

Association with previously reported loci

To date, several SNPs from three major QTLs (2p15, 6q23 and 11p16) have been reported in association with HbF/F-cells,19, 20, 21, 22 though many of these SNPs are in moderate-to-high LD and are likely tagging the same genetic signal at each locus. We estimated the correlations among reported SNPs based on ASW HapMap (phase II and III) data, and an association summary of these SNPs (r2<0.3) is shown in Supplementary Table 6. From the 2p15 region, other than independent genome-wide significant rs7606173 and rs766432 SNPs, we also observed an association with the same direction effect for SNP rs6732518 (P-value <9.5 × 10−4) (Supplementary Table 6). Similarly, significant association from the 6q23 region was observed for rs9399137 (P-value <0.001) and rs4895441 (P-value <5.0 × 10−4), even after Bonferroni correction for the number of independent loci tested (α=0.05/13, P-value0.0038), confirming the role of genetic variation in MYB-HBS1L in the regulation of F-cell levels.

Discussion

In recent years, remarkable progress has been made through the use of genome-wide scans, demonstrating the feasibility of an unbiased approach to identify novel targets for therapeutic interventions.40, 41 Given these successes, several GWAS have been attempted to elucidate the genetic regulation of HbF and F-cell levels in diverse populations.19, 20, 22, 23 Here, we present results of a GWAS conducted on 440 individuals from the SIT Trial cohort. This study represents the largest genome-wide scan of the proportion of F-cell levels in patients with SCD of African ancestry.

Our results confirm the robust genetic association of BCL11A in the modulation of F-cell levels. A strong effect of this region was originally seen in non-anemic Europeans19 and subsequently replicated in other populations.20, 21, 22 Through this study, we not only confirm the previously associated BCL11A SNP (rs766432), but also report an independent effect of rs7606173 in F-cell level regulation (Supplementary Table 2). The observed effect of rs766432 is in the same direction as reported in previous GWAS and candidate gene studies performed in African Americans.22, 38 Recently, as part of a HbF regulation fine mapping association study in an African American cohort with SCD, Galarneau et al.23 reported the strongest genetic effect in BCL11A at intronic SNP rs4671393. Using stepwise conditional regression, they also reported two additional SNPs (rs7599488 and rs10189857) exhibiting independent genetic effects on HbF levels. In our study, we observed similar association effects for these SNPs, and rs4671393 is in perfect LD with rs766432 (r2=1) (Supplementary Tables 2 and 3, Supplementary Figure 1). All three SNPs reported by Galarneau et al.23 are in the same LD block as our two independent SNPs (rs766432 and rs7606173) (Figure 4). It is noteworthy that, after conditioning on the genetic effect of rs766432, we observed the enhanced genetic significance of rs7599488 and rs10189857 observed by Galarneau et al.23, but this significance does not hold when the additional effect of rs7606173 is included (Supplementary Table 3). Our result suggests rs7606173 captures the entire genetic effects of these reported SNPs, and rs766432 and rs7606173 represent two independent effects. This locus demand further functional validation for a better understanding of BCL11A in the regulation of HbF levels.

Given the previous evidence of sex-specific differences in HbF/F-cell levels,13 we performed sex-stratified analysis and report a novel locus from the chromosome 17p13.3 (GLP2R) region, associated with F-cell levels in men (rs12103880; P-value <3.41 × 10−8) (Supplementary Table 5). GLP2R encodes a G protein-coupled receptor that participates in cellular signaling through multiple G proteins to affect the cyclic adenosine monophosphate and mitogen-activated protein kinase pathways, leading to both proliferative and anti-apoptotic cellular responses. Accounting the cell proliferative and oncogenic function of the two major previously reported QTLs (MYB and BCL11A), the genetic association of GLP2R region seems to be of high relevance and needs to be replicated in larger samples.

Other than BCL11A, we did not observe any genome-wide significant association of previously reported QTLs. Though our study failed to identify genome-wide significant SNPs (permuted threshold <1.27 × 10−7) at chromosome 6q23 (MYB-HBS1L intergenic region), under a candidate SNP approach, signals from rs9399137 (P-value <0.001) and rs4895441 (P-value <5.0 × 10−4) loci support their genetic involvement in F-cell regulation (Supplementary Table 6). Additionally, on chromosome X, we observed moderately strong association for rs12559632 (P-value <2.64 × 10−6) and rs6630120 (P-value <1.48 × 10−5) SNPs from PHEX (Xp22.2) and MAGEB18 (Xp21.3) genes, respectively (Supplementary Table 1). Our observation for the potential involvement of PHEX region in F-cell modulation is in agreement with a recent GWAS performed in an African American cohort.22 Interestingly, both these genes are close to the Xp22.2 locus, which was identified by a linkage study and associated with the regulation of HbF levels.18

This study differs from recent GWAS reports in patients with SCD in that F-cell number rather than total HbF was used exclusively as the HbF phenotype. The F-cell phenotype was used by Menzel et al.19 in the initial identification of BCL11A in a non-SCD population. Total HbF levels in patients with SCD are a reflection of three independently regulated factors: F-cell production rate, preferential survival of F-cells and the amount of HbF per F-cell.42 Results from GWAS studies, using the total HbF phenotype, might identify genes contributing to any of these processes, whereas our study identifies genes contributing to F-cell production or preferential F-cell survival, but not HbF per F-cell. We believe that focusing on F-cells is a rational approach. The high correlation between F-cell number and total HbF (r2>0.9)8, 9, 10 suggests that HbF variation is determined largely by F-cell number, not HbF per F-cell. Additionally, given that the goal of HbF-based therapy for SCD is to reduce polymerization of HbS in as many cells as possible, manipulation of F-cell number rather than the amount of HbF per F-cell is likely to have much greater impact on the course of SCD. Patients with SCD already have relatively high HbF per F-cell (average of 38%),42 which should be sufficient to inhibit HbS polymerization in F-cells.

In summary, we report a novel independent genetic variant (rs7606173) associated with F-cell regulation, and, in men, a novel locus at 17p13.3 (GLP2R) is associated with F-cell levels. Additionally, we validate the genetic significance of BCL11A region (2p15) and MYB-HBS1L SNPs. This study highlights the importance of denser genetic screens of the BCL11A region in large and well-powered studies. Confirmation of these variants might help to improve the prediction of one's ability to produce HbF in response to disease, and will have implications for prenatal diagnosis and genetic counseling of patients with SCD.