Molecular characterisation of phenylketonuria in a Chinese mainland population using next-generation sequencing

Phenylketonuria (PKU) is an inherited autosomal recessive disorder of phenylalanine metabolism, mainly caused by a deficiency of phenylalanine hydroxylase (PAH). The incidence of various PAH mutations differs among race and ethnicity. Here we report a spectrum of PAH mutations complied from 796 PKU patients from mainland China. The all 13 exons and adjacent intronic regions of the PAH gene were determined by next-generation sequencing. We identified 194 different mutations, of which 41 are not reported before. Several mutations reoccurred with high frequency including p.R243Q, p.EX6-96A > G, p.V399V, p.R241C, p.R111*, p.Y356*, p.R413P, and IVS4-1G > A. 76.33% of mutations were localized in exons 3, 6, 7, 11, 12. We further compared the frequency of each mutation between populations in northern and southern China, and found significant differences in 19 mutations. Furthermore, we identified 101 mutations that are not reported before in Chinese population, our study thus broadens the mutational spectrum of Chinese PKU patients. Additionally, 41 novel mutations will expand and improve PAH mutation database. Finally, our study offers proof that NGS is effective, reduces screening times and costs, and facilitates the provision of appropriate genetic counseling for PKU patients.

families as well as prenatal diagnosis, and to refine diagnoses in and anticipate the dietary requirements of affected patients [33][34][35] .
A large-scale, unbiased comprehensive survey of PAH mutations in Chinese population was not available. Most previous analyses concerning the Chinese population were limited to a few common mutations, or were confined to a certain region of PAH, resulting in a selective bias 3,12,[36][37][38][39][40][41] . One report carried out survey of entire PAH in a small-scale survey including 212 patients in mainland China 12 .
In the present study, we report a spectrum of PAH mutations complied from a large cohort of 796 PKU patients in mainland China. We determined the sequence of entire PAH gene using next-generation sequencing (NGS). Among 194 mutations identified, 41 were not reported in literature, and 101 not reported in Chinese population. We believe that these results will facilitate the development of appropriate genetic counseling for PKU patients in China.

Results
Mutation spectrum. In the included cohort of Chinese patients, potential disease-causing mutations were identified on 1516 of 1592 independent alleles, corresponding to a mutation detection rate of 95.23%.A total of 720 patients were completely genotyped, whereas in the remaining 76 individuals only one causative mutation was identified; the other mutation site could not be identified by this platform. Among the fully genotyped patients, two mutations were detected in 683 of the patients, who had either compound heterozygous (n = 622) or homozygous (n = 61) genotypes, three mutations were found in 35 of the patients, and four mutations were revealed in 2 of the patients. The gene analysis results were summarised in Table 1. A total of 194 different types of mutations were identified, including 134 missense mutations (69.07%), 25 splice-site mutations (12.89%), 18 deletions (9.28%), 14 nonsense mutations (7.22%), 2 insertions (1.03%), and 1 silence/splice (0.52%). 76.33% of the total mutations are found in exons. Most mutations was localized in exon7 (33.44%), exon11 (13.18%), exon6 (10.48%), exon12 (10.29%), exon3 (8.94%). Interestingly, no mutations were identified in exon13.

Comparison between northern and southern Chinese populations. The frequencies of each
PAH mutations were compared among geographical regions in China. We used sixteen mutations with frequency more than 1% for this comparison. Six mutations showed obvious local mutation clustering: p.V399V, p.R413P and IVS7 + 2T > A were found to be clustered in northern Chinese populations, and p.R241C, p.R408Q and p.Y166* were clustered in southern Chinese populations. Among the remaining 178 mutations that exhibited a relative frequency of <1%, thirteen exhibited significant differential distribution in northern and southern China.  Among these mutations, the majority were missense mutations (n = 25), followed by small deletions (n = 8), nonsense (n = 2), splice (n = 5) and insertions (n = 1). Thirty-four of the novel mutations were detected in coding regions, and the remaining were located in introns.
The prediction results of novel mutations are listed in Table 2. A total of 25 novel mutations were predicted; of these, 20 mutations were predicted to be probably damaging, 5 mutations were tolerated, and the remainder of the 16 mutations could not be predicted using this tool. The frequencies of the novel mutations were relatively low, which indicates that they are rare mutations.

Discussion
A comprehensive survey of the mutation spectrum of the protein of interest in a given population not only can provide insight into the structural and functional aspects of the protein as well as genotype-phenotype correlations 14,15,20,22,23,26,28,42 , but also facilitate genetic counseling in patients' families. In this study, we described the molecular basis of PKU in a mainland Chinese population by analysing mutations in the PAH gene using NGS. Among a cohort of 796 patients, mutations were detected on 1516 of the 1592 independent alleles, representing a mutation detection rate of 95.23% (Table 1). A total of 194 distinct mutations were found, demonstrating the high genetic heterogeneity that is inherent in PKU.
The number of different mutations in a given population is usually high, and is typically comprised of a few prevalent mutations and a large number of private mutations 43 . In comparing our study with previous reports, 83 of the mutations that we identified have been previously reported; however, the remaining 101 mutations (including 41 novel mutations) were reported for the first time in a Chinese mainland population 12,[36][37][38][39]41 . As shown in Table 3, eight mutations including p.R243Q, p.EX6-96A > G, p.V399V, p.R241C, p.R111*, p.Y356*, p.R413P and IVS4-1G > A are common mutations in the mainland Chinese population, although the rank order of these mutations was different. Among them, R241C, p.EX6-96A > G, p.V399V, R413P, R243Q and R111* are also considered to be prevalent in the Chinese Taiwanese population 40 . The epidemiology of phenylketonuria in China is complicated. The prevalence of PKU in northern China (1/11,000) is close to what has been documented in Caucasian populations, but the prevalence in southern China is much lower 44 . A comparison between northern and southern China indicated marked differences in the relative frequencies of mutations. Among eight common mutations, V399V, R241C, and R413P gave significant p values between two regions, with respective p values of 0.037, 0.007, and 0.002. The result that p.R413P clustered in northern China is consistent with what has been reported by Gu. et al. 12 . Furthermore, p.V399V has been previously detected primarily in populations of Xinjiang and northern China 36,37 , whereas p.R241C has been primarily detected in populations in southern China and in Taiwan patients 3,45 . The majority of the population in Taiwan has descended from southeast China. Based on our data, we hypothesize that the uniform distribution of V399V, R241C, and R413P is a result of migration and the founder effect. It is well known that different ethnic groups have their own distinctive and diverse PAH mutant allele series that include either one or a few prevalent founder alleles 46 . When comparing PAH mutational data between different ethnic groups, correlations between the mutations in and the genetic histories of the investigated populations were found. Marked differences were identified when comparing PAH mutations with ≥ 3% frequency (totaling 34) between Asian and European countries (Table 4). Five mutations, including p.R243Q, p.EX6-96A > G, p.R241C, p.R413P and IVS4-1G > A, were found to be common mutations in East Asian countries such as China, Japan 14 and Korea 8 , accounting for 53.76%, 69.70%, and 62.10% of the total mutations respectively. Three mutations, including p.R111*, p.Y356* and p.T278I, were frequently detected in China and Japan 14 , China and Korea 8 , Japan 14 and Korea 8 respectively. The remaining mutations, including p.V399V, p.A259T, p.R252W, p.Y325* and p.V388M, were found to be common in only one country. In sharp contrast, these mutations except for p.R252W, the above mentioned mutations were either rarely detected or undetected in West Asia and Europe countries. For example in Iran 17 and Turkey 18 , three common mutations including p.R261Q, p.P281L, and IVS10-11G, A were shared, However these mutations were either rare or did not occur in East Asia. Instead, they were prevalent in select Europe countries. In Europe, p.R408W was found to be the common mutation , ranking first in the Czech Republic 28 (East Europe) and Germany 5 (West Europe), and second in Danemark 4 (North Europe) and Serbia 32 (South Europe). These results suggest that p.R408W was the most prevalent founder allele in the European population 46 . In contrast, p.R408W was either rarely detected or undetected in populations of East Asia, whereas it was the most common mutation in Turkish populations 18 . The remaining mutations were only found to be prevalent in subsets of the four countries. For example, p.L48S was the most prevalent mutation in Serbian populations 32 , and it was also common in Turkish populations 18 . The p.R158Q, p.A403V, p.Y414C, and IVS12 + 1G > A mutations were relatively common in only two countries. Based on the above comparison, we identified that there were several overlaps of mutant allele distributions between West Asia and Europe and that the mutations that were common in East Asia were different from these.
In the present study, the high mutation detection rate of 95.23% was similar to previous studies in which sequencing analysis of the PAH gene was conducted 12,15 , but it was relatively lower than the results from studies that employed exon analysis combined with multiplex ligation-dependent probe amplification (MLPA) 14,20,24,28 . This is because NGS is able to detect small deletions and insertions, whereas it is not able to detect large deletions or duplications. Despite scanning the entirety of the PAH coding region and its exon-intron boundaries, no mutations were detected in 76 alleles. The most likely explanation behind this is that the mutations are located in regions that were not detectable in this study (for example, in the promoter regions, the 5′ and 3′ UTRs, in non-coding RNA binding sites, or in the intronic sequences far away from exon-intron boundaries). Alternatively, the mutations may have been large deletions or duplications.
Using NGS as a routine genetic diagnostic tool enables thousands of DNA sequences to be simultaneously obtained in notably reduced turnaround times and at a significantly reduced cost. Furthermore, this technique provides high sensitivity, specificity and coverage (including all coding regions of the involved exons and adjacent intronic regions). However, the biggest limitation to using NGS is the need to analyse and interpret complicated data.  We believe our study will provide guidance for future medical practice such as prenatal screening and early diagnosis of PKU. Diagnostic methods can be developed based on the known characteristics of a population. Currently, PKU screening is performed in newborn babies as a part of the tertiary prevention in birth defects preventive network in China. Developing a new method for screening might enhance primary prevention. Based on the mutational spectrum presented in this study, our hope is that carrier screening can be conducted preceding gestation, which would offer timely guidance with respect to prenatal diagnosis for couples who are both carriers.

Subjects.
A total of 796 unrelated patients from 29 separate newborn screening centres of China were enrolled. These patients were diagnosed at birth either through a neonatal screening program or  based on clinical presentation. Demographic data, including age, consanguinity, family history, and geographical origin, and biochemical testing data, including plasma phenylalanine (Phe) levels, dihydropteridine reductase activity, urinary biopterin and neopterin ratio, and tetrahydrobiopterin loading, were collected. The ages of the included patients ranged from 6-months to 5-years old. In families with more than one patient, only one member of each sibling pair was included in the study of mutation frequency. The numbers of patients in northern China and southern China that were divided by the Qinling Mountains and the Huaihe River were 557 and 239, respectively. Both parents of the included patients were native. These patients were classified into one of three separate phenotype categories according to their pretreatment plasma Phe levels, including mild hyperphenylalaninaemia (Phe 120-600 umol/L), mild PKU(Phe 600-1200 μ mol/L), and classic PKU (Phe >1200 μ mol/L) 47   To improve the throughput of the assay, the PCR primers were designed to not only amplify the target DNA but also to provide a unique primer index for each of the 96 samples in each of the plates (multiple index PCR). This strategy resulted in 96 sets of 10-bp-long nucleic acid tags that were individually included at the 5′ ends of each of the PAH and HBB primers. The second index that was used for each sample was the 8-bp-long nucleic acid tag from the library adapter sequence, which identified the specific 96-well plate that each sample was included in. This index was attached to the amplicons of the samples through an adapter preparation process. Using this "double index system, " hundreds of samples can be mixed together and detected in one sequencing chip at the same time.
PCR was performed on a GeneAmp PCR system 9700 (Applied Biosystems, Foster City, CA), with a cycling protocol that consisted of denaturation at 94 °C for 30 s, 56° C for 30 s, and 72 °C for 1 min. After 35 cycles, gel electrophoresis was used to verify the quality of the amplified DNA, only eligible DNAs was included in the library preparation. The PCR amplification products were prepared for DNA pooling, and the diverse adapter library was added onto the amplification products. After a concentration of DNA was obtained that could satisfy the requirements of the library preparation method, gene mutations were sequenced using an Illumina Hiseq 2000 (Illumina Inc, San Diego, CA, USA) sequencing instrument.
After sequencing the samples, the raw sequence data were analysed using in-house software. First, all sequence data were traced back to the specimens from which they arose according to the sequences of the primer and adapter indices. Second, the amplicon sequences of each of the samples were aligned with standard reference PAH sequences from the database PAHvdb (http://www.biopku.org/pah/); SNPs were found in target areas and relevant information was noted.
All of the PAH gene sequencing reactions and analyses were performed in the Centre of BGI Health clinical laboratory, Shenzhen, China.  Validation tests of Sanger sequencing. When a given patient's mutation locus was detected, it was amplified by polymerase chain reaction (PCR) from a parent's sample and then sequenced bidirectionally in an ABI-3730 DNA analyser. This not only validated the locus but also confirmed the carrier status of the parents. The PCR cycling protocol consisted of an initial denaturation at 95 °C for 3 min, followed by 35  Pathologic analysis of novel mutations. A SIFT prediction was performed (http://sift.jcvi.org/) using the "SIFT Human SNPs" tool to obtain predictions for nonsynonymous SNPs. The annotation version was Homo sapiens GRCh37 Ensembl 63. A list of the chromosomal positions and alleles corresponding to the 41 novel mutations were uploaded into the import the web-site.
Statistical analysis. Statistical analysis was performed using statistical package for social science software (SPSS version 16.0). Mutational frequencies were calculated by the counting method. An x 2 analysis was performed to test for differences between two geographic populations. A p value <0.05 was considered statistically significant.