Predictive SNPs for β0-thalassemia/HbE disease severity

β-Thalassemia/HbE disease has a wide spectrum of clinical phenotypes ranging from asymptomatic to dependent on regular blood transfusions. Ability to predict disease severity is helpful for clinical management and treatment decision making. A thalassemia severity score has been developed from Mediterranean β-thalassemia patients. However, different ethnic groups may have different allele frequency and linkage disequilibrium structures. Here, Thai β0-thalassemia/HbE disease genome-wild association studies (GWAS) data of 487 patients were analyzed by SNP interaction prioritization algorithm, interacting Loci (iLoci), to find predictive SNPs for disease severity. Three SNPs from two SNP interaction pairs associated with disease severity were identifies. The three-SNP disease severity risk score composed of rs766432 in BCL11A, rs9399137 in HBS1L-MYB and rs72872548 in HBE1 showed more than 85% specificity and 75% accuracy. The three-SNP predictive score was then validated in two independent cohorts of Thai and Malaysian β0-thalassemia/HbE patients with comparable specificity and accuracy. The SNP risk score could be used for prediction of clinical severity for Southeast Asia β0-thalassemia/HbE population.

www.nature.com/scientificreports/ disease modifiers, single SNP might not be a good prediction of disease severity. In addition, genetic factors modify disease through a complex mechanism of multiple genes interaction of a biological network. Here the two β 0 -thalassemia/HbE GWAS data 3,4 of 487 cases were reanalyzed to search for minimum predictive SNPs for disease severity by using a SNP interaction prioritization algorithm, interacting Loci (iLoci) 5 . Combination of three SNPs located on chromosome 2p16.1 (BCL11A), 6q23.3 (HBS1L-MYB) and 11p15.4 (HBE1) could predict disease severity with more than 85% specificity and 75% accuracy. The three predictive SNPs were validated in two independent cohorts of Thai and Malaysian patients with similar accuracy and specificity.

Identification of predictive SNPs.
As multiple SNPs demonstrated to be disease modifiers of β 0thalassemia/HbE, single SNP might not have enough power to predict disease severity. Thus, reanalysis of SNP-SNP interaction of the previously reported two GWAS data 3,4 was performed by iLOCi algorithm 5 . The highest disease severity associated SNP pair ranked by the SNP pair interaction analysis was rs766432/rs9399137, which predicts disease severity at 62.5% accuracy ( Table 1). Combination of the two highest disease severity associated SNP pairs, rs766432/rs9399137 and rs72872548/rs9399137 could increase prediction accuracy to 72.0%. While three SNP pairs, combination of the prior two pairs and rs11208724/rs1407273 have 72.3% prediction accuracy (Table 1). Although the disease severity prediction accuracy increased with the increasing number of SNP pairs, the prediction accuracy of two or more SNP pairs was not much different. Thus, only three SNPs from two SNP pairs were selected for further evaluation, which are rs766432 located in BCL11A gene on chromosome 2p16.1, rs9399137 located in HBS1L-MYB intergenic region on chromosome 6q23.3 and rs72872548, located in HBE1 gene on chromosome 11p15.4. In addition, reanalysis for single SNP association of the disease severity was performed on the two GWAS data by FaST-LMM, which utilizes the linear mixed model to select the associated SNPs 6 . The three SNPs, rs766432 (BCL11A), rs9399137 (HBS1L-MYB) and rs72872548 (HBE1), showed high level of significance among the three previously reported regions that strongly associated with disease severity (chromosome 2p16.1, chromosome 6q23.3 and chromosome 11p15.4) 3,4 ( Table 2). The agreement between two reanalysis methods suggested that the three SNPs were good candidate predictive SNPs for constructing the disease severity predictive model. The allele frequency of the three predictive SNPs in cohort 1 Thai β 0 -thalassemia/HbE patients compared with another ethnic group is showed in Table S3.
Determination of risk score for β-thalassemia/HbE disease severity predictive SNPs. β 0 -Thalassemia/HbE patients have a wide spectrum of clinical symptoms. To determine whether the selected SNPs account for moderate symptom patients, the three predictive SNPs were genotype in 181 patients with moderate symptom as well as in the mild and severe symptoms in the discovery cohort 1. The genotype frequency of the three predictive SNPs among 668 β 0 -thalassemia/HbE patients in the cohort 1 with different severity (mild, moderate and severe) is showed in Table 3. Odds ratio (OR) analysis was then performed to determine risk genotypes of each SNP between two different disease severity groups; mild vs moderate, mild vs severe and moderate vs severe in the cohort 1 ( Table 4). The rs766432 genotype AA has increased risk for severe symptoms (mild vs severe: OR = 0.39, 95% CI = 0.26-0.58, P = 4.98 × 10 -6 ; moderate vs severe: OR = 0.51, 95% CI = 0.34-0.77, P = 1.45 × 10 -3 ). The rs9399137 genotype TT has increased risk for severe symptoms (mild vs severe: OR = 0.35, 95% CI = 0.23-0.51, P = 2.16 × 10 -7 ). While the rs72872548 genotype CC has increased risk for Table 1. SNPs pairs interaction analysis of the GWAS data by iLOCi algorithm searching for predictive SNPs for disease severity of β 0 -thalassemia/HbE patients.  Table 3. Genotype frequencies of predictive SNPs among mild, moderate and severe β 0 -thalassemia/HbE patients in cohort 1. In order to apply the risk genotypes for disease severity prediction, score was assigned to the SNP genotypes as 0, 1 and 2 accordingly to the risk from low to high risk of severe symptom. Therefore, SNPs predictive risk score for rs766432 were CC = 0, AC = 1 and AA = 2; for rs9399137 were CC = 0, TC = 1 and TT = 2 and for rs72872548 were AA = 0, AC = 1 and CC = 2 ( Table 5). The score of each predictive SNPs was combined and used for interpretation of disease severity prediction. Three scoring models of interpretation of the combine SNPs risk score were generated (Table 5) and evaluated in the 668 β 0 -thalassemia/HbE cohort 1. The first scoring model, model 1, yield the highest sensitivity, 60.0%, and accuracy, 71.2%. While the model 3 resulted in the highest specificity, 81.4% (Table 6). To determine which scoring model is better in disease severity prediction, the three scoring models were further validated in two independent Thai and Malaysian β 0 -thalassemia/HbE cohorts.

SNP ID Genotype
Validation of predictive SNP scoring system. The three-SNP risk scoring models were validated in 122 cases β 0 -thalassemia/HbE patients in two cohorts, 64 Thai patients (cohort 2) and 58 Malaysian patients (cohort 3). The allele frequency of the three predictive SNPs in the Thai and Malaysian validation cohorts are shown in Table S3. Consistence with the results from cohort 1, the model 1 have the highest sensitivity and accuracy compared to model 2 and model 3 in every cohorts. While the model 3 have highest specificity in every cohorts (Table 6).

Discussion
β 0 -Thalassemia/HbE patients have a wild range of disease severity from the unrequired of regular blood transfusion to transfusion dependent. Prediction of the disease severity is helpful for clinical management and quality of life of the patients. Knowing the severity of the affected child during pregnancy could aid in genetic counseling. After birth, this could also be useful for clinical decisions of transfusion program such as frequency and age needed for blood transfusion. In addition, this will aid in deciding whether to perform hematopoietic stem cell transplantation, which is more efficient if it is performed early in life. Nevertheless, clinical phenotype can take few years to stabilize, and in such a situation the possibility of anticipating the clinical severity could be essential. Two β 0 -thalassemia/HbE GWAS studies have identified a number of risk SNPs. Here, FaST-LMM, iLOCi and machine learning software by WEKA was used to examine SNPs interaction and determine predictive SNP risk score.
This study explored the predictive SNPs for disease severity in 668 β 0 -thalassemia/HbE patients. Analysis of SNP interaction revealed that rs766432 (BCL11A), rs9399137 (HBS1L-MYB) and rs72872548 (HBE1) are highest associated with clinical symptoms. Three risk scoring models, which differ in assigning score 5 to moderate and/or severe symptoms, were evaluate for prediction of clinical severity in three independent cohorts. This due to the rs72872548 genotype AC is high frequency among mild, moderate and severe cases, 78.33, 77.35 and 63.84%, respectively. The score model 1 have the highest sensitivity and accuracy. However, it would difficult Table 5. Predictive SNP scoring system for β 0 -thalassemia/HbE disease severity.  Table 6. Sensitivity, specificity and accuracy of predictive SNP score system assessed from cohort patients. www.nature.com/scientificreports/ to predict whether the patients would be moderate or severe for those who have risk score 5. The score model 3 have highest specificity, 81.4-88.8%, in every cohorts. Although, assigning score 5 to moderate group might have some severe patient miss prediction as moderate, this rather better than predict moderate patients as severe patient in such cases the patients might undergo unnecessary aggressive treatments. The three predictive SNPs identified here are located in three loci that associated with HbF levels, which affect the central mechanism underlying disease pathophysiology, the degree of the excess of α-globin chains and globin chain imbalance. Several GWAS studies showed the three loci are associated with HbF level 3,7,8 . The BCL11A is a transcription factor act as the specific erythroid repressor of HbF expression in the developmental silencing of the mouse and human HBG genes 9,10 . The HBS1L-MYB intergenic region contains distal enhancers required for MYB activation 11 . The c-MYB, a transcription factor, is key regulator of hematopoiesis and erythropoiesis. cMYB represses HBG genes expression via KLF1 activation of BCL11A 12 . In addition, several GWAS studies showed association of SNPs on BCL11A and HBS1L-MYB with red blood cells parameters such as mean corpuscular volume (MCV) 13,14 . The β-globin cluster, Xmn1-HBG2 (rs782144) influence on HbF level has long been discovered through candidate and genetic linkage studies and later confirmed by GWAS studies. The rs72872548 (HBE1) is located in the same haplotype block as the Xmn1-HBG2.

Scoring model Cohort Sensitivity Specificity Accuracy
The use of multiple genetic factors to predict β-thalassemia was first reported in a study in Sardinian homozygous β 0 -thalassemia. Three genetic factors, α-thalassemia, rs11886868 in BCL11A and rs9389268 in HBS1L-MYB were account for 75% of the phenotype severity 15 . A study of a mix of Mediterranean (3/4) and Asian (1/4) patients from France used five genetic factors, β-thalassemia mutations, α-thalassemia, the Xmn1-HBG2, rs11886868 in BCL11A and rs9399137 in HBSB1L-MYB have 83.2% predictive accuracy 16 . A thalassemia severity score (TSS) was developed from 890 homozygous β-thalassemia patients of the Mediterranean basin using these different genetic modifiers including sex, β-thalassemia mutations, α-thalassemia, the Xmn1-HBG2, rs1427407 and rs1018957 in BCL11A and rs9399137 in HBS1L-MYB 17 . In this study, the β 0 -thalassemia/HbE patients who carry β + -thalassemia mutations or were found positive for the common Thai α-thalassemia mutations were excluded. Hence, the predictive score comprise of only three SNPs, which are located at the same loci as the previous studied.
The limitation of the predictive SNPs scoring is that different ethnic groups may have different allele frequency and linkage disequilibrium structures. This might render the scoring less accurate in untested populations. The TSS score was validated in a North African cohort which showed that allelic frequencies of the SNPs are different compared to the Mediterranean 18 . According to the International HapMap Project, the rs1018957 (BCL11A) allele A/G frequency are quite different among European (0.580/0.420), African (0.722/0.278) and this study Thai β 0 -Thalassemia/HbE patients (0.182/0.818). The allele frequency of the predictive SNPs in this study of both Thai and Malaysian population are comparable. This suggested that the three SNPs can be used as predictive SNPs for disease severity at least for Southeast Asia where β 0 -thalassemia/HbE is prevalent.
This study, to the best of our knowledge, is the first SNP risk score for prediction of clinical severity developed for a β 0 -thalassemia/HbE Southeast Asia population. This may assist in inform prognosis and guide therapeutics. However, the predictive SNPs of this study might not predict disease severity in β +(severe)/ β 0 -thalassemia and β 0 / β 0 -thalassemia patients who a have higher degree of α-to non-α-globin imbalance and being the cause of the severe thalassemia phenotype. Nevertheless, the predictive SNPs validation also required in other populations.

Methods
Subjects and thalassemia diagnosis. This study was performed in accordance with the Helsinki Dec- Three cohorts of β 0 -thalassemia/HbE patients were enrolled, Thai discovery cohort (cohort 1), Thai validation cohort (cohort 2) and Malaysian validation cohort (cohort 3). In this study, the β 0 -thalassemia/HbE patients who carry β + -thalassemia mutations or were found positive for the common Thai α-thalassemia mutations were excluded. The cohort 1 comprised 668 cases (180 mild, 181 moderate and 307 severe) of Thai β 0 -thalassemia/ HbE patients. A total of 487 cases (180 mild and 307 severe) were randomly selected from the two GWAS analysis of mild and severe cases 3,4 . In addition 181 moderate patients were newly combining in the cohort 1. Two independent validation cohorts of Thai and Malaysian patients comprised 64 Thai cases and 58 Malaysian cases, respectively.
After informed consent was obtained, EDTA blood was collected and hematological data analysis was performed using an automated cell counter (ADVIA 120, Bayer, Tarrytown, NY). Hemoglobin analysis and quantification were determined by the automated hemoglobin cation exchange high-performance liquid chromatography (Bio-Rad variant II, Bio-Rad Hercules, CA). The β-thalassemia mutations were characterized using reverse dot blot hybridization 19 . The α-thalassemia deletional mutations were genotyped by multiplex GAP-PCR 20 . The α-thalassemia point mutations, Hb Constant Spring and Hb Pakse, were determined by dot blot hybridization 21 . The disease severity of β 0 -thalassemis/HbE presenting for mild, moderate and severe symptoms was classified by scoring system based on 6 parameters; hemoglobin at steady state, age at receiving first blood transfusion, requirement for blood transfusion, size of spleen, age at thalassemia presentation and the growth and development 22 . Predictive SNP discovery. The disease severity association analysis was performed on the two separated GWAS analysis datasets 3,4 of 180 mild and 307 severe of cohort 1 by two techniques i.e. iLOCi 5  www.nature.com/scientificreports/ tion and FaST-LMM 6 for single SNP association. The predictive model was later constructed using a standard machine learning software, Waikato Environment for Knowledge Analysis or WEKA (http:// www. cs. waika to. ac. nz/ ml/ weka/). The classification accuracy of disease severity was measured from the Hidden Naive Bayes model included in WEKA.
Statistical analysis. All statistical procedures were performed using SPSS v.18.0 software package (SPSS Inc., Chicago, IL). Odds ratio (OR) analysis was used to identify risk genotypes of the predictive SNPs associated with disease severity.