Survey and characterization of nonfunctional alleles of FUT2 in a database

The expression of ABO antigens in human saliva is regulated by the FUT2 gene, which encodes a secretor type α(1,2)fucosyltransferase. Secretors express ABO substrates in saliva and non-secretors do not. Secretor status is an object of concern, especially for susceptibility to various infectious diseases. A multitude of single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) have been reported, and they show unique distributions among different populations. In this study, we selected 18 uncharacterized FUT2 alleles listed in the Erythrogene database and obtained genomic DNA having these alleles. We experimentally confirmed the haplotypes, but 10 of 18 alleles disagreed with those in the database, which may be attributed to their low frequency. We then examined the activity of the encoded α(1,2)fucosyltransferase for 13 alleles by flow cytometry of H antigen expression. The impact of each nonsynonymous SNP on the enzyme was also estimated by software. We finally identified two non-secretor alleles (se610and se357,856,863) and one weak secretor allele (se262,357), while in silico analysis predicted that many alleles impair the function. The present results suggest that correct haplotyping and functional assays are desirable for analysis of the FUT2 gene.

The 1000 Genome Project (https ://www.inter natio nalge nome.org/) is the largest public catalogue of human variation and genotype data 22 . Erythrogene v0.8 (27 Nov 2017) (http://www.eryth rogen e.com/) extracted the data of blood group alleles from 1000 Genomes and matched them against blood group reference lists 23 (Table 2). In this study, we experimentally confirmed the nonsynonymous SNPs and their haplotypes (alleles). We also analyzed the encoded enzyme activities by transient expression assays. In addition, we predicted the effect of each nonsynonymous SNP on the protein by several in silico methods.

Results
Sequence and haplotype determination of FUT2. To survey the nonfunctional alleles of FUT2 in the database, we determined the DNA sequence of the total coding region of the FUT2 gene of 22 individuals who were expected to have uncharacterized alleles according to Erythrogene 23 . We detected all of the registered SNPs without deficiency and excess in respective DNA samples in the database by direct Sanger sequencing of the FUT2 coding region and encountered 17 uncharacterized nonsynonymous SNPs that have not been identified yet: 58C>T, 80T>C, 205G>A, 262A>C, 370G>A, 539G>A, 542C>T, 544G>A, 566T>C, 610G>T, 616G>C, 715C>T, 856T>C, 863C>T, 964A>G, and 980C>A. In addition to these, the alleles of 357C>T, 403C>T, and 950C>T were also investigated, although 403C>T and 950C>T themselves have been identified in Latin Americans (Se 357,403 ) and a Mongolian (se 357,950 ), respectively 17,19 . We then determined their haplotypes by subcloning them into plasmids. However, sequencing the clones revealed that the haplotypes of ten of 18 alleles were different from those registered in the database, even when the distance between SNPs was only 154 bp (positions 385 and 539 in HG02130, positions 216 (of se 428 ) and 370 in HG04017, Table 2). Among the ten alleles, three uncharacterized nonsynonymous SNPs (370G>A, 544G>A, and 863C>T) were found on the nonfunctional allele se 428 , and 403C>T and 950C>T were found on another chromosome (Table 2). Accordingly, we performed functional analyses of the 13 alleles shown in Table 3. Haplotyping revealed that 863C>T was on two different alleles, 357C>T; 856T>C; 863C>T (HG03973 (Indian Telugu in the UK)) and 428G>A (plus 4 additional SNPs); 863C>T (NA18908 (Yoruba in Ibadan)) ( Table 2).

Functional analyses of candidates of non-secretor alleles.
To determine whether each uncharacterized FUT2 allele encodes a functional Se enzyme or not, we ligated each allele into the mammalian expression vector pcDNA3.1, then transfected it into COS-7 cells. The cell surface H antigen was examined by flow cytometry using a monoclonal antibody to H type 1-4 (1E3) 24 . We performed five independent transient expression experiments, and the representative result for each allele was shown in Fig. 1. The expression levels of H antigens on cell surface are shown in Table 3. The percentage of H antigen-positive cells transfected with pcDNA3.1   Table 3).
Determination of causal SNP of FUT2-357C>T; 856T>C; 863C>T for inactivation of the encoded enzyme. As mentioned above, the expression study suggested that FUT2 of 357C>T; 856T>C; 863C>T was a nonfunctional allele. However, this allele contained two nonsynonymous SNPs, 856T>C and 863C>T, and it was unclear which SNP was involved in inactivation of the Se enzyme. We generated FUT2 of 357C>T; 856T>C and FUT2 of 357C>T; 863C>T by in vitro mutagenesis and performed transient expression and flow cytometry. As shown in Table 3, 856T>C completely impaired and 863C>T somewhat impaired the activity.
Estimation of significance of uncharacterized nonsynonymous SNPs. In addition to the transient expression study, we predicted the possible impacts of 14 amino acid substitutions on the functions of encoded proteins using four software programs.
The results of predictions were not always consistent with those of expression experiments (Table 3). When FUT2 of 610G>T, of 357C>T; 856T>C; 863C>T, and of 262A>C; 357C>T are considered to be nonfunctional alleles and the others functional, the ratio of estimated mismatch was 42.9%, 64.3%, 57.1%, and 64.3% by Muta-tionTaster, MutationAssessor, PolyPhen-2, and SIFT, respectively. Of 14 SNPs, the predicted effects were matched for all software and experiments for M38T, R191Q, G215W, and Y297H, while they were completely mismatched for I333V, R250W, V200A, A80T, and P192L ( Table 3). The software generally tended to overestimate the impacts of the nonsynonymous SNPs we tested here.

Discussion
Recently, public databases are available that provide genetic variation data, and websites dedicated to various purposes using these data are also accessible. The 1000 Genomes Project Phase 3 provides a list of variants and haplotypes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping 23 . One study reported that phasing and imputation for rare variants are unreliable, which likely reflects the limited sample size of the project data 25 . In this study, we observed discordance between the haplotypes on Erythrogene from the 1000 Genomes project and those experimentally phased haplotypes for ten out of 18 haplotypes. Actually, the FUT2 alleles we investigated have frequencies of 0.02 to 0.06% in the studied 2504 samples of 26 populations on 1000 Genomes, that is, one to three of 5008 alleles. That may be the reason for the discordance, but it should not be ignored in association studies on secretor status when its prevalence is at a certain level in the target populations. Previously, we identified 400G>A (rs370886251) of a functional FUT2 allele Se 400 in a Samoan population with a frequency of 2.1% (one of 48 alleles) 26 . This allele was also observed in Indonesians at 1.2% (two of 166 alleles) 27 . In addition, we encountered it in six New Guinean populations with the relatively high prevalences of 10.6% (42 of 114 alleles) to 36.8% (7 of 66 alleles) 28 . These results suggest that 400G>A is distributed in Oceania and neighboring countries. However, 400G>A is not listed in 1000 Genomes and subsequently Erythrogene. These observations raise the possibility that some of the nonsynonymous SNPs we examined here may be present at unignorable frequencies in certain populations 25 .
The best way to evaluate the activity of the Se enzyme may be consideration in conjunction with phenotypic data using red blood cells or saliva as specimens. In the Lewis blood group, Le(a − b +) is identified as a secretor, Le(a + b −) as a non-secretor, and Le(a + b +) as a weak-secretor 29 . We can also measure the amount of ABO substances in saliva. However, we could not perform phenotyping and can only speculate on the secretor status based on the genetic information. In this study, we estimated the activity of the enzyme encoded by FUT2 alleles of interest by transient expression and subsequent flow cytometry for detection of cell surface H antigens expressed. In addition, the impacts of the target missense SNPs on the Se enzyme are also deduced by four software programs because we experienced a discrepancy between the results obtained by in vitro and those by in silico analyses [17][18][19] . In the results, we also observed discrepancies for six to nine out of 14 SNPs, while all of the SNPs reside in the catalytic domain. Previously, we identified a FUT2 allele with 357C>T; 685G>A in a Bangladeshi secretor individual with genotype of 357C>T; 685G>A/302C>T. The α(1,2)fucosyltransferase activity of the enzyme encoded by FUT2 of 357C>T; 685G>A was 8% of that of the wild-type allele 26 . From this observation, we categorized FUT2 of 566T>C as a secretor allele (Se 566 ) and that of 262A>C; 357C>T as a weak secretor allele (se 262,357 , Se w ), while the secretor status phenotypes were unknown and the 566T>C substitution seemed to partially impair enzyme activity (Table 3).
In this study, we used an academically isolated monoclonal antibody, but the H antigens could be evaluated by the commercially available UEA-1 lectin (Ulex europaeus agglutinin I) 19,21,24 . In addition, as mentioned earlier, the sequence of the FUT2 is coded solely in exon 2. This makes it easy to clone the coding sequence or determine the haplotype of the coding polymorphisms.
Considering all the factors mentioned above, experimental procedures such as the haplotyping by cloning and evaluation of activity of the Se enzyme are desirable for FUT2. However, the 1000 Genome Project provides a valuable catalogue of human genetic variation, and the Erythrogene focus on blood group systems also enables easy reference to the FUT2 alleles from the 1000 Genome Project. Actually, the existences of listed SNPs themselves were completely confirmed in the indicated samples. In addition, the software programs for prediction of the functional impacts of amino acid substitutions are valuable tools, especially for the proteins whose functions are hard to estimate experimentally. Accordingly, there is no doubting the helpfulness of these databases or software programs.

Conclusion
We identified two non-secretor alleles (se 610 and se 357,856,863 ) and one weak secretor allele (se 262,357 ) in samples in the 1000 Genome Project. Experimental phasing and expression studies are desirable for analysis of FUT2.

PCR amplification of coding region of FUT2 and sequence analysis. The coding regions of FUT2
were amplified and directly sequenced as described previously 17 .
Haplotype determination of FUT2. To determine the haplotypes of individuals who were heterozygous at two or more positions, we cloned PCR products into a plasmid and sequenced the clones as described previously 17  .After 2 days of culture, the cells were incubated with 1E3 antibody for H type 1-4, followed by incubation with FITC-conjugated goat anti-mouse IgM (Bethyl Laboratories, Montgomery, TX) secondary antibody, and expression of H antigen on the cell surface was monitored by flow cytometry (BD Accuri C6, Becton Dickinson, Franklin Lakes, NJ) as described previously 18,24 . The experiments were performed five times. About 1 × 10 5 cells were lysed and the firefly luciferase activity was assayed using ONE-Glo Luciferase Assay System (Promega, Madison, WI) and the similar transfection efficiency in each experiment was confirmed by the intensity of luciferase light ( Table 3).