Thalassemia is caused by a mutated or missing globin gene resulting in a decrease or structural abnormalities in one or more globin peptides. Such imbalances in globin chain production disrupt the proper assembly of hemoglobin tetramers, causing red blood cell embrittlement and hemolytic anemia disease.1 Thalassemia is distributed mainly in coastal areas of the Mediterranean, Africa, the Middle East, India, and southeastern Asia.2,3 Other studies have shown an unusually high prevalence of thalassemia among people in the southern provinces of China such as Guangxi, Guangdong, and Hainan.4,5,6 Our recent investigations added the Yunnan province to the high-prevalence regions of thalassemia.7 Most individuals carry four α-globin genes and two β-globin genes. Two β-globins and two α-globins serve as a scaffold that holds four heme molecules with iron to form hemoglobin. The α-thalassemia carriers resulting from a deletion or dysfunction of one allele are also called “silent carriers” because the hematological profile is generally entirely normal. When deletion or dysfunction occurs in two alleles, either in trans or in cis, a mild asymptomatic microcytic anemia (α0-thalassemia) is present. Inactivation of three α-globin alleles, “hemoglobin H disease,” has variable presentations from mild to severe. A lack of four α genes (α0-thalassemia homozygote) is fatal: children are born with lethal hydrops fetalis unless they receive a transfusion in utero.8 β-thalassemia carriers, also referred to as minor, who result from a deletion or dysfunction of one allele, have mild clinical symptoms. The lack of two β-globin alleles, β-thalassemia major, usually presents as severe anemia requiring lifelong transfusions; however, at times, this has a variable, milder presentation.9 Therefore, effective premarital screening in regions of the world where α0-thalassemia carriers or β-thalassemia minor are prevalent is important.

Yunnan province is located in southwestern China and links several southeastern Asian neighboring populations. Some ethnic minorities of Yunnan, like the Dai, represent a cross-border nationality. The Dai people have a high population density and frequent ethnic intermarriage. Genetic studies of Yunnan Dai people including investigation of thalassemia are limited. Next-generation sequencing (NGS) allows the generation of vast amounts of genomic data in order to reveal the genetic constitution of people and to evaluate potential health risks. NGS has been widely used for noninvasive prenatal diagnosis and novel mutation detection in thalassemia.10,11 In this study, we used NGS for a large-scale population screening program in order to assess the thalassemia carrier frequency among the Dai people in Yunnan and to explore its potential use in preventing severe thalassemia and reducing pediatric mortality.

The Dai people are one of several ethnic groups residing primarily in Xishuangbanna Dai Autonomous Prefecture and Dehong Dai and Jingpo Autonomous Prefecture in Yunnan Province, southwestern China. They are closely related to the Lao and Thai people, who form a majority in Laos, Thailand and other countries and regions in south and southeastern Asia. The Thai people—the largest ethnic group in Thailand—are also called Thai in Cambodia and Vietnam, Dai in China, Shan in Myanmar, Lao in Laos, and Assam in India, and are thought to share a recent common origin.12 People are classified as Dai in China if at least one of their parents belongs to the Dai ethnic group and they typically speak one of the southwestern Thai languages .

Materials and Methods

Samples and demographic data

The study was approved by BGI-IRB (BGI’s institutional review board on bioethics and biosafety) and all individuals provided informed written consent. A total of 1,451 people who were premarital or newlywed ethnic minorities from Dehong (Ruili, Mangshi, Lianghe, Yinjiang, Longchuan) and Xishuangbanna (Menghai, Menghun, Mengzhe) prefectures, Yunnan Province, China, were involved in the screening. Among these samples, 951 premarital or newlywed individuals were of Dai ethnicity. Only individuals aged 18 to 45 years old were included. The age distribution was as follows: 18–25 years old, 583; 26–30 years old, 266; 31–35 years old, 79; 36–41 years old, 20; unrecorded age, 3 (Supplementary Table S1 online details the demographic data).

Traditional carrier screening using hematological phenotype analysis

All samples were screened using traditional hematological methods. This included routine blood examinations and hemoglobin electrophoresis for each sample. Hematology phenotypes were identified if a positive result was obtained for at least one of following: (i) RBC indexes of low cellular pigment, including mean corpuscular volume (MCV) ≤80 fl and/or mean corpuscular hemoglobin (MCH) ≤27 pg,13 and (ii) HbA2 ≤2.5% (abnormal hemoglobin concentration for a suspected α-thalassemia carrier) or HbA2 ≥3.5% (abnormal hemoglobin concentration for a suspected β-thalassemia carrier) associated with fetal hemoglobin (HbF) ≥2.0% in some cases.

Hematological tests were performed using Automated Hematology Analyzer XS 500i (Sysmex, Kobe, Japan) for routine blood examinations and V8 Capillary electrophoresis system (Helena Biosciences Europe, Tyne and Wear, UK) for the hemoglobin analysis. Sequential hematological screening was defined as positive if MCV or MCH was positive first and then their HbA2 was positive. Parallel hematological screening was defined as the sum of the MCV and MCH positive results and HbA2 positive results.

NGS screen using targeted capture

Preparation for DNA samples. Genomic DNA was extracted from 200-μl blood samples using the Kingfisher Flex (Thermo Scientific, Rockford, IL) and isolated using the GenMag Nucleic Acid Isolation kit (Magnetic bead method) (GenMagBio, Beijing,China). DNA extracts were arrayed in 96-well plates and the concentration was quantified by Nanodrop-8000 (Thermo Scientific). We restricted our analysis to samples with a DNA concentration >20 ng/ml and an A260/A280 ratio between 1.8 and 2.0.

PCR amplification, pooling, library construction, and next-generation sequencing

We designed six pairs of primers for polymerase chain reaction (PCR) amplification corresponding to four gene mutations (HBA1, HBA2, HBB-1, and HBB-2) and two deletion mutations (HBA-Q and HBB-Q). The amplicons, in principle, should detect most known disease-causing point mutations and copy-number variations (CNVs) in the HbVar Database. The primers are related to the following patents: WO/2014/023076, WO/2014/023167, and CN102952877. PCR reactions were performed in 96-well plates, with each sample corresponding to one library. Ninety-six kinds of index sequences were designed, corresponding to each well of the plate. The six primers marked by the index sequence were known as the index primers. All samples were barcoded using these index primers. PCR reactions (25 μl) were performed with the index primers, 50–200 ng DNA, and 2× GoldStar Taq MasterMix (CoWin Bioscience, Beijing, China). Amplicons were sequenced using the ABI 9700 (PerkinElmer Applied Biosystems, Foster City, CA) and L69G (LongGene Scientific Instruments, Hangzhou, China) platforms. Point mutation thermal cycling conditions were 95 °C for 10 min, 95 °C for 30 s, annealing temperature for 30 s, 72 °C for 50 s, 35 cycles, 72 °C for 5 min, and 15 °C until the amplicons were pooled. CNV thermal cycling conditions were as follows: 95 °C for 10 min, 95 °C for 30 s, annealing temperature for 1 min, 24 cycles, and 15 °C until the amplicons were pooled. The four-point mutation PCR amplicons were pooled into one centrifuge tube with equal volume, and the two CNV amplicons were pooled into a second tube. We required ≥5 μg (pooled point mutation) and ≥1 μg (pooled CNV) amplicons.

We adopted the Illumina Hiseq sequencing library preparation protocol for library construction, including purified genomic DNA (Qiagen DNA Purification kit), DNA quantification (NanoDrop 8000 UV-Vis Spectrophotometer; Thermo Fisher Scientific), DNA fragmentation (excluding CNV amplicons), blunt-ended fragmentation (Enzymatics kits), 3’-dA overhang, Illumina Hiseq paired-end adapters ligation (Illumina HiSeq), and DNA fragment separation (CNVs not included), followed by size selection using agarose gel electrophoresis and the StepOne Plus real-time PCR system. Sequencing was performed using the paired-end tag (PE100) protocol with an Illumina HiSeq2000 machine. We generated a total of 1.5 Gbp per genomic library (Supplementary Figure S1 online).

All samples were tested and sequenced in batches by a second laboratory. Routine blood examinations were performed in five hospitals in Dehong and three hospitals in Xishuangbanna. The hemoglobin electrophoresis was also examined by two laboratories in Dehong and Xishuangbanna. All NGS sequencing was completed in four batches.

Data analysis and allele assignment

Based on the resequencing strategy, a bioinformatics process focused on detecting Hb gene point mutations and deletions was developed (Supplementary Figure S2 online). We excluded low-quality sequences from further analysis. Filtered sequence reads were partitioned by samples based on the respective adapter information (index primer). We processed single-nucleotide polymorphism and InDel versus CNVs using different strategies. The mutation-associated strategy was as follows: raw reads were aligned on the target region reference using the BWA program14 with default parameters and the consensus sequence was generated by the SAMtools15 software package. Coverage, depth, and length were recorded for each consensus using ReSeqTools.16 Single-nucleotide polymorphism and InDel results were filtered based on sequencing quality and read depth. Mutation categories were assigned based on the results of alignment between filtered consensus sequence and the HbVar Database. Based on the normalization of the target gene data with endogenous references, we estimated the relative ratio between the samples and normal controls using the read-depth statistics of the HBA1-Q, HBA2-Q, HBB-Q, and internal control genes. The variance and standard deviation of each cluster were obtained using the clustering method. The shortest distances were selected as the optimal value from the distances between each value and the mean. This was used to generate the absolute CNV for each sample.

We validated our NGS approach for detection of thalassemia carriers using three approaches: (i) we compared NGS results of 51 random samples with their Sanger sequencing results; (ii) we tested nine NGS-positive samples with the Sanger sequencing results; (iii) 23 samples that had at least one of the hematological indexes and a negative NGS result were selected to make a comparison between their NGS and Sanger sequencing results.

For the purpose of this study, codon 26 mutation is included in the term β-thalassemia.


NGS methodological validation

We initially performed a series of validation experiments by comparing results generated by NGS and Sanger sequencing independently as follows:

  • 1. Sanger sequencing results from 51 random samples were found to match the NGS results completely

  • 2. Sanger sequencing confirmed nine samples detected as positive by NGS

  • 3. Twenty-three cases were positive according to routine blood testing and hemoglobin electrophoresis screening but negative according to NGS and Sanger sequencing

Thalassemia carriers found by NGS

In total, 471 thalassemia mutation carriers were identified from 951 samples ( Figure 1 ). We determined that the Dehong population had a higher carrier rate for composite α-thalassemia and β-thalassemia carriers when compared with those from the Xishuangbanna population (Supplementary Table S1 online).

Figure 1
figure 1

Diagram of work flow and outcomes.

Among composite α-thalassemia and β-thalassemia carriers, more than 62.2% (51/82) of mutations consisted of a specific deletion (-α3.7/αα) in addition to an HBB gene point mutation. In addition, composite carriers consisting of the deletion or gene mutation (αCSα or --SEA/αα) and an HBB gene mutation were also common. The composite -α3.7/αα and codon 26/βA carriers are the most common and occur mainly in the Dehong population. Several rare hemoglobin gene mutations, such as c.95 + 1G>A, c.1delA, and Hb Queens Park, were detected in this study although they have not previously been reported in Mainland China (Supplementary Table S2 online).

We identified 21 distinct types of α-thalassemia mutations. More than 44.5% (129/290) of carriers harbor a gene deletion, namely -α3.7/αα. The --SEA/αα is less common. Although α-thalassemia carrier frequencies were similar in the Dehong and Xishuangbanna populations, the rank order of the two major mutant alleles differed. For example, 16.3% (111/680) and 5.4% (37/680) of the carriers were -α3.7/αα and --SEA/αα, respectively, in the Dehong population, whereas 14.0% (38/271) and 6.6% (18/271) of the carriers were --SEA/αα and -α3.7/αα, respectively, in the Xishuangbanna population (Supplementary Table S3 online).

We identified 10 β-globin gene mutations in the screened cohort. More than 63.6% (63/99) of samples had the codon 26/βA gene mutation. The codon 17/βA and codons 41–42/βA genotypes are the second and third most frequent. The top three most abundant β-thalassemia gene mutations were codon 26/βA, codon 17/βA, and codons 41–42/βA: 8.4% (57/680), 0.6% (4/680), and 0.6% (4/680) in the Dehong population and 2.2% (6/271), 3.3% (9/271), and 2.6% (7/271) in the Xishuangbanna population, respectively (Supplementary Table S4 online).

α-globin and β-globin mutations and ranks of population carrier frequencies

There were 12 types of α-globin gene mutations detected in this study. Among them, the two most frequent gene mutations were -α3.7 and --SEA, representing 80.0% (335/419) of all mutations. The mutations of αCSα, αWSα, and -α4.2 occur frequently and occurred in a total of 15.0% (63/419) of all α-globin mutations ( Table 1 ). The most common α-globin gene mutation from the Dehong population differed from the Xishuangbanna population, despite the fact that these Dai populations are thought to share a common recent origin. The carrier rate of the -α3.7 deletion was estimated to be as high as 23.0% (219/951) in this study and thus differed from all previously published reports.7,17,18

Table 1 Carrier rates of α-globin and β- globin gene mutations and constituent ratios in two populations

Eleven types of β-globin gene mutations were detected in this study. The three most common are codon 26, codons 41–42, and codon 17, which represent 87.9% (167/190). Once again, the rank order differed significantly between the Dehong and Xishuangbanna populations ( Table 1 ). Codon 26/βA predominates in the Dehong in contrast to the Xishuangbanna. Mutations -50 G>A, -28 A>G, and Hb Dhonburi are also common. Three cases of Hb Dhonburi, which were reported in populations in Italy, Iran, and Thailand, were first detected in mainland China.19,20,21,22 One case of Hb Hope matched the previous records.7,17,18 Another two cases of c.316-238C>T were verified; years ago, such cases were reported only in India.23

Comparison between traditional hematological and NGS screening methods

Detection rate differences. Although 452 cases of low cellular pigment were screened by RBC indexes from 951 samples (a positive rate of 47.5%), only 77 suspected α-thalassemia carriers remained after RBC indexes and hemoglobin electrophoresis results were combined. The detection rate using the traditional screen method was only 16.4% (61/372); 83.6% (311/372) of α-thalassemia carriers were missed using traditional approaches. Similarly, β-thalassemia gene mutation detection rates were 72.9% (132/181) based on RBC indexes combined with hemoglobin electrophoresis ( Figure 2 ). By contrast, NGS predicts much higher carrier frequencies. We predicted an α-thalassemia carrier frequency of 39.1% (372/951) and a β-thalassemia carrier frequency of 19.0% (181/951). We estimated the false-negative rates of α-thalassemia detection to be 23.4% by RBC indexes (87/372, including 1 α0-thalassemia carriers) and 79.8% (297/372, including 72 α0-thalassemia carriers) by hemoglobin electrophoresis. The false-negative rates of β-thalassemia detection by RBC indexes were 17.1% (31/181) and 10.5% (19/181) by hemoglobin electrophoresis. The predominant false-negative genotypes by RBC indexes and hemoglobin electrophoresis are reported in the Supplementary Data online.

Figure 2
figure 2

Methodological comparison of thalassemia mutations. (a and c) α-thalassemia and (b and d) β-thalassemia. M&M or H=MCV&MCH-positive or HbA2-positive; M&M&H=MCV&MCH-positive and HbA2-positive; SEQ=NGS-positive.

There were 99 carriers with --SEA (24 composited with other mutations); 47 carriers had false-negative results in HbA2 indexes (2.5–3.5), implying that the hemoglobin electrophoresis was not sensitive enough for --SEA. No one had false-negative results in RBC indexes.

Both RBC indexes and hemoglobin electrophoresis had a high missed diagnosis ratio for thalassemia detection (Supplementary Figure S3 online). Moreover, the MCV+MCH and HbA2 sequential combined detection strategy resulted in low sensitivity and a high missed diagnosis ratio for combined carriers of α- and β- thalassemia. The sensitivity improved with the MCV+MCH and HbA2 parallel combined detection screen when compared with only routine blood detection for β-thalassemia. This observation was not true for α-thalassemia carriers. The MCV+MCH and HbA2 parallel combined detection screen only moderately improved detection sensitivity, with a concomitant significant loss in specificity ( Table 2 , Supplementary Tables S5–S9 online, Figure 2 ).

Table 2 Sensitivity and specificity of hematological indexes for thalassemia carriers screened by NGS


Carrying rate, mutation types, and rare mutations by NGS method

In the present study, we found a much higher thalassemia carrier rate of 49.5% among the Dai people screened by an NGS method than did previously reported datasets using a hematological method of screening. This is also the first study to reveal a precise carrier rate of thalassemia in an adult cohort of the Dai people in China. Our results seemed incredibly high; however, they matched some regional studies of thalassemia. A carrier rate of 43.17% was reported for the Dai people in Dehong (unpublished data). Another report of Dai children in Yunnan also revealed high carrier frequencies in Dehong and Xishuangbanna (unpublished data).

The α-thalassemia mutation was estimated to be up to 30.5% (290/951) primarily due to the αCS and αWS mutations, which are rarely reported.24 A series of rare mutations, such as c.95 + 1G>A, c.1delA, were also first detected in this study. The -α3.7 and --SEA are the first and second among the most abundant α-thalassemia mutations in the Dai people, which matched previous reports.25 Compared with other countries and regions with a high proportion of Thai people, the α-thalassemia mutations of the Dai people in Yunnan, Myanmar, Thailand, and Vietnam share a common set of frequent genotypes. The most common α-thalassemia genotype is -α3.7/αα, followed by a set of other less common α-thalassemia genotypes, including --SEA/αα, αCSα/αα, and -α3.7/-α3.7. In addition, we report that the Dai people in Yunnan have a high proportion of αWSα/αα carriers—an observation that was not observed in the records of any other country or region.

Among β-globin mutants, codon 26 showed the highest carrier rate, matching previous investigations26; it forms abnormal hemoglobin E (HbE). Codons 41–42 show a higher carrier frequency than codon 17. Predominant mutations of β-thalassemia are codon 26, codons 41–42, codon 17, and -50 G>A in Yunnan. Our previous study suggested that the HbE mutation may be relevant for human adaptation in the Yunnan province.10 Based on our present findings and previously published data, all of the countries and regions in southeastern Asia show a high proportion of codon 26/βA and codon 26/codon 26. Codons 41–42 were the most common in five countries, including China (including Hong Kong and Yunnan), Cambodia, Thailand, and Vietnam. The Dai people in Yunnan showed the highest proportion of -50 G>A mutations; this was not reported in other countries.

With respect to composite genotypes, codon 26/βA+-α3.7/αα and codon 26/βACSα/αα are more prevalent than others. As expected, the Dai people in this study showed genotype distributions similar to those of Thailand and Vietnam populations. The similarities in composite genotypes and mutational spectrums of the hemoglobin gene suggest that the Dai and Thai nationalities may share a common ancestry and closer genetic kinship with the Kinh in Vietnam ( Table 3 ). The Thai, Kinh, and Dai people share frequent commercial exchanges and intermarriage, reinforcing the genetic relationship among these three groups. This is in contrast to the Dai and native populations in Cambodia, Hong Kong, India, Laos, and Myanmar where genetic exchange is rarer.27,28

Table 3 Most common allele frequencies or mutated genotypes of the countries and regions where Dai people occupy a high proportion of the total population

Methodology comparison

We demonstrate the superiority of NGS as a screen and confirmation method in high-prevalence populations. One possible explanation for the low sensitivity of traditional hematological screening may be due to the relatively higher detection rate of the “silent” α-thalassemia. The genotypes (including -α3.7/αα, -α4.2/αα, αCSα/αα, αWSα/αα) usually present normal values in MCV, MCH, and HbA2 indexes; 33.3% (43/129) of -α3.7/αα carriers (unincorporated with β-thalassemia) were missed using the routine hematological screen method. Similarly, 63.6% (7/11) of the -α4.2/αα genotype, 57.1% (8/14) of the αWSα/αα genotype, and 33.3% (5/15) of the αCSα/αα genotype were misdiagnosed using routine hematological screen methods. In the present study, 91.5% (118/129) of -α3.7/αα-type carriers were missed due to HbA2 >2.5 using the hemoglobin electrophoresis ( Table 2 , Supplementary Tables S3, S6, and S9 online); 15.9% (10/63) of carriers with the codon 26/βA genotype ratio were missed using the routine blood method. Diagnosis of 1 out of 24 carriers of the common β0 type (codons 41–42/βA and codon 17/βA) were missed using the routine hematological screen method (Supplementary Tables S4 and S5 online). Our statistical results support the observation that some silent thalassemia carriers are most likely not detected. The routine blood test was less likely to miss β-thalassemia carriers and had fewer limitations than the α-thalassemia.

The high proportion of hematology abnormalities suggests that the Dai population is more likely to exhibit iron deficiency. The significantly decreased red cell MCV and MCH, even among α+ thalassemia carriers, that are usually “silent” may be compounded by iron deficiency in the Dai population.29 Further research will help define this; however, our findings provide a genetic basis for this difference.

We also found significant differences in both α-thalassemia and β-thalassemia carrier rates between groups from Dehong and Xishuangbanna, respectively. The overall thalassemia carrier rate was estimated to be 49.9 % (339/680) in the Dai people from Dehong. Their α-thalassemia carrier rate, including composite α-thalassemia and β-thalassemia genotypes, was 39.3% (267/680), with -α3.7/αα and --SEA/αα being among the most common. Their β-thalassemia carrier rate, including composite α-thalassemia and β-thalassemia carriers, was 20.3% (138/680), predominantly codon 26/βA. The overall thalassemia carrier rate was 48.7% (132/271) in the Dai people from Xishuangbanna. Their α-thalassemia carrier rate, including composite α-thalassemia and β-thalassemia carriers, was 38.7% (105/271), with --SEA/αα being the most common. Their β-thalassemia carrier rate, includin g composite α-thalassemia and β-thalassemia carriers, was 15.9% (43/271), predominantly codon 17/βA.

The genetic heterogeneity of the Dai people in Xishuangbanna is greater than that for those in Dehong. This may relate to Xishuangbanna’s unique geographical position as a transportation hub to southeastern Asia and mainland China, resulting in more genetic exchange and greater diversity. Although the two prefectures have similar α-thalassemia and β-thalassemia carrier frequencies, the Dai people in Xishuangbanna are at greater risk for giving birth to children with severe thalassemia because of a more complex and varied set of α-globin and β- globin gene mutations. Codons 41–42 and codon 17 mutations, for example, are more severe (based on the degree of hemophthisis) than codon 26. Similarly, the deletion allele --SEA/αα results in a serious clinical manifestation of anemia. We propose that requiring thalassemia carrier screening, particularly in premarital or newlywed Dai people in Xishuangbanna, may be crucial for preventing the birth of children with severe thalassemia.

Thirteen hemoglobin-H (HbH) carriers (not including composite α- and β-thalassemia carriers) were found in our survey. HbH patients present obvious microcytic hypochromic anemia, hepatosplenomegaly, mild jaundice, and other symptoms. HbH patients often showed clinical manifestations that varied greatly. The genotype of HbH patients matched the α-globin gene mutation types mainly in two prefectures. HbH carriers with --SEA/-α3.7 were frequently from Dehong, whereas those with --SEA/ αWSα mainly originated from Xishuangbanna. Because our samples were taken from an adult cohort of reproductive age, the HbH patients detected in this research did not manifest serious clinical symptoms.

HbE is enriched in southeastern Asia, primarily in Laos, Thailand, Cambodia, and China. HbE carriers generally show no clinical manifestations and are difficult to diagnose, especially among Yunnan minority populations; we hypothesize that local intermarriage has increased the risk of disease in this area. HbE and β-thalassemia carriers have a complex and diverse set of phenotypes ranging from asymptomatic to requiring frequent clinical blood transfusions. No HbE/β0 carriers were detected in our study.

In summary, we report a high frequency of missed thalassemia carriers based on conventional hematological methods. Using an NGS approach, we analyzed more than 300 α-hemoglobin and β-hemoglobin mutations using a single test with a cost-effective price for each sample. Our approach significantly reduces false-negative results and misdiagnoses, and also reduces the need for repeated blood sampling and further referral tests. Our strategy may facilitate carrier screen programs in areas with a high prevalence of thalassemia. However, considering the complexity of α- and β-globin gene mutations in this population, there is always the possibility of misinterpretation of results, incorrectly assigning increased risk when there is none, and vice versa. Genotypic diagnoses must be interpreted by health-care workers and counselors properly trained in globin gene genetics and all its clinical manifestations.


The authors declare no conflict of interest.