Introduction

The human histo-blood group ABO system is crucial in safe blood transfusion and cell/tissue/organ transplantation1,2. This system consists in A and B oligosaccharide antigens expressed on red blood cells (RBCs) as glycoproteins and glycolipids and antibodies against those antigens in serum. A and B antigens are also expressed by epithelial and endothelial cells and in secretor type individuals they are also expressed on mucins secreted by exocrine glands. The immuno-dominant structures of A and B antigens are GalNAcα1->3(Fucα1->2)Gal- and Galα1->3(Fucα1->2)Gal-, respectively. A and B alleles of the ABO genetic locus encode A and B transferases, which respectively transfer an N-acetyl-D-galactosamine (GalNAc) or a D-galactose (Gal) to H substances with an α1,3-glycosidic linkage. H substances with the Fucα1->2Gal- structure are synthesized by fucosylation catalyzed by α1,2-fucosyltransferases (α1,2-FTs) encoded by FUT1/FUT2/SEC1 genes. FUT1-encoded α1,2-FTs and FUT2/SEC1-encoded α1,2-FTs exhibit distinct acceptor substrate specificity and are differentially expressed amongst tissues. In humans SEC1 is a pseudogene and FUT2 gene presents frequent null alleles so that about 20% of individuals are incapable of expressing either H, A, or B antigens in secretions (non-secretor type). In the absence of α1,2-FTs no H antigens are produced. Therefore, A/B transferases function only when at least one active α1,2-FT is simultaneously present.

In 1990 we correlated the nucleotide sequences of A, B and O allelic cDNAs and the expression of A and B antigens and elucidated the molecular genetic basis of human histo-blood group ABO system3,4. Four amino acid substitutions (Arg176Gly, Gly235Ser, Leu266Met and Gly268Ala) discriminate A and B transferases. A single nucleotide deletion (261delG) was found in O alleles. We later identified mutations in A/B subgroup alleles (A2, A3, Ax and B3) and mutations in cis-AB and B(A) alleles specifying dual expression of A and B antigens5,6,7. Another type of O allele, which lacks 261delG but contains a Gly268Arg substitution, was found afterward8. ABO alleles registered in the Blood Group Antigen Gene Mutation Database exceed 250 and ABO has become one of the most studied human genetic loci for its polymorphism9.

ABO genes exist not only in humans but also in many other vertebrate species although ABH antigen expression patterns may be different. In addition to A and B transferases, there are additional enzymes transferring a GalNAc/galactose by α1,3-glycosidic linkage: α1,3-galactosyltransferase and isogloboside 3 synthase (both of galactose specificity) and Forssman glycolipid synthase (GalNAc specificity). These enzymes catalyze the last synthetic steps of α1,3-galactosyl epitope (Galα1->3Galß1->4GlcNAcß-), isogloboside 3 (Galα1->3Galß1->4Glcß1-Ceramide) and Forssman glycolipid antigen (GalNAcα1->3GalNAcß1->3Galα1->4Galß1->4Glcß1-Ceramide), respectively. It should be noted that these enzymes utilize other acceptor substrates than H substances as the chemical structures of their reaction products indicate. Genes encoding these α1,3-Gal(NAc) transferases (α1,3-Gal(NAc)Ts) (GGTA1, A3GALT2 and GBGT1 genes, respectively) are paralogous to the ABO gene and they are evolutionarily related10,11,12,13. Although transferase activity remains to be demonstrated for its encoded protein, another paralogous genetic locus, GLT6D1 (glycosyltransferase 6 domain containing 1), was associated to periodontitis susceptibility14. Based on the nucleotide and deduced amino acid sequences of ABO and related genes, a birth-and-death evolution model was proposed15,16. Several theories have been proposed on the evolution of the primate ABO polymorphism17,18,19,20,21,22. And the dynamics of the human ABO gene evolution have been extensively studied23,24. A brief summary of prior knowledge about ABO evolution will be presented in each individual sub-section in the Results section. Indisputably, sequences, single nucleotide polymorphisms (SNPs) and mutations are critical to investigate gene evolution. However, the analyses based solely on sequences are insufficient especially because of genetic recombination. To interpret gene evolution properly knowledge of the gene-encoded proteins is fundamental. What is the protein function, which portion(s) of the protein are important for that function, where is the protein located, does the protein form multimers, how does the protein interact with other molecules, etc., all provide valuable information. Especially, in order to investigate the ABO gene evolution the understanding of the sugar specificity of A and B transferases is essential. As in many other areas of genetic studies, functional assays are of critical importance.

In the present work, we analyzed many homologous genes and sequences that had been identified in various species through genome sequencing efforts. In addition to the sequences, we also utilized additional data and information available: gene structure to determine whether a gene is partial or complete; chromosomal organization to deduce duplication(s), deletion(s), inversion(s) and translocation(s) that have occurred; and information on A/B transferases and A/B oligosaccharides to obtain clues on functionality. Data were interpreted with caution because of the incompleteness of genome sequence databases, wrong annotations and differences among individuals within a species and errors in genome assemblies. Based on mostly relevant, but not entirely accurate, data, we have delineated a potential scenario of the ABO gene evolution. Taking advantage of our expertise, we also prepared several dozens of amino acid substitution constructs of the human A transferase in an expression vector by in vitro mutagenesis, determined their GalNAc/galactose specificity and generated a code table correlating amino acid sequence motif with A/B specificity. Utilizing this table, we decoded the A/B specificity of the ABO genes annotated from a variety of species, which in turn has allowed us to uniquely evaluate several critical hypotheses on the evolution of the ABO and related genes and their functional impact.

Results

Gene duplications and changes in substrate specificity of the encoded glycosyltransferases created ABO family of genes in animals

All the α1,3-Gal(NAc)T genes in genome databases that were analyzed are listed in Fig. 1. Species were aligned based on their evolutionary relationship (human at top and lamprey at bottom)25. A phylogenetic tree was constructed for the 104 protein sequences that are likely to encode functional α1,3-Gal(NAc)Ts and is shown in Fig. 2. GBGT1, A3GALT2, GGTA1 and GLT6D1 genes formed separate clusters, whereas both A and B genes were clustered into a single ABO gene cluster. Except that many nonfunctional genes are omitted, these results obtained from amino acid sequence analysis coincided well with the nucleotide sequence-based Ensembl gene tree ENSGT00400000022032 and a previous report15.

Figure 1
figure 1

Species-dependent distribution of FUT1/FUT2/SEC1 α1,2-fucosyltransferase genes and ABO/GBGT1/A3GALT2/GGTA1/GLT6D1 α1,3-Gal(NAc) transferase genes.

This table shows the distribution of α1,2-FT genes and α1,3-Gal(NAc)T genes in a variety of organisms. Ensembl gene identifiers are listed only with the meaningful digits, excluding 0 s on the left from their IDs. Genes were categorized into groups based on Ensembl gene trees, chromosomal locations and our own analyses and they are aligned in different columns and shown highlighted in different colors. Amino acid sequences corresponding to the codons 266–268 of human A/B transferases are also shown. The symbol “---” indicates the absence of sequence motif and “N/A” means not annotated in databases. A single column of “Pseudo/Ancient” was used to list two types of annotated gene sequences: The ABO retropseudogene sequences that were originally derived from an intronless cDNA are highlighted in tan color (Pseudo) and the sequences that formed a cluster next to the ABO gene in the phylogenetic analysis are highlighted in yellow (Ancient). The gene sequences that formed a cluster outside of the ABO/GBGT1 genes are highlighted in orange and they are shown separately in the “ABO/GBGT1 Ancient” column. The annotated genes may or may not be functional, the latter of which may also be called as O genes or pseudogenes. Note that genome sequences were not complete for many species and therefore, errors may exist. In addition, there are numerous homologous sequences that have yet to be annotated and mapped on chromosomes. Furthermore, polymorphism may also exist.

Figure 2
figure 2

Evolution of α1,3-Gal(NAc) transferase family of genes.

The MEGA5 software was used to analyze 104 amino acid sequences potentially encoding intact ABO proteins. The amino acid sequences corresponding to codons 69–354 of the human A transferase were examined. 1,000 bootstrap replications were computed. Branches leading to ABO, GBGT1, A3GALT2, GGTA1 and GLT6D1 genes are colored in yellow, grey, green, purple and blue, respectively. The bootstrap frequencies are shown on the branching points. Fishes, amphibians, reptiles and birds are marked with closed circles in red, purple, green and dark blue whereas mammals are unmarked. The species code names correspond to the names shown in the “Ensembl Database” column in Fig. 1. For instance, PTR for chimpanzee (Pan troglodytes) is obtainable by removing ENS and G from the database name (ENSPTRG).

The genes neighboring those glycosyltransferase genes are conserved well in many species and the consensus organizations are shown in Table 1. There is a wide variation in the repertoire of those genes among different species and the model of birth-and-death evolution26 fits well with the α1,3-Gal(NAc)T family of genes as previously reported15. For instance, amphibian Xenopustropicalis frog has ABO genes but lacks any other α1,3-Gal(NAc)T genes whereas all the bird species examined have GBGT1 but lack A3GALT2, GGTA1 and GLT6D1 genes.

Table 1 Consensus organization of genes surrounding α1,3-Gal(NAc) transferase and α1,2-fucosyltransferase genes

Emergence of α1,2-fucosyltransferase genes preceded A/B transferase gene appearance in amphibians

Phylogenetic analyses and their chromosomal locations were used to separate FUT1, FUT2 and SEC1 genes and they are shown in 3 different columns in Fig. 1. The distributions of these genes suggest that FUT2 gene was the oldest α1,2-FT gene. FUT1 gene later appeared from FUT2 lineage after gene duplication followed by acquisition of novel expressional/enzymatic characteristics. SEC1 gene emerged much later after duplication of FUT2 gene and following divergence from it, confirming the evolutionary theory previously proposed of α1,2-FT family of genes27. The chromosomal region containing α1,2-FT genes has remained stable in many species and the consensus is shown in Table 1.

A/B antigen expression was previously reported in frog species28,29. As shown in Fig. 1, neither FUT1/FUT2/SEC1 genes nor ABO genes are present in fish genomes. Contrastingly, amphibian Xenopustropicalis frog has 4 FUT2 gene sequences, several of which seem to encode active α1,2-FTs. This frog species also contains multiple ABO gene sequences, including a few with possible functionality. Chinese softshell turtle and many mammalian genomes also possess potentially functional α1,2-FT and A/B transferase genes. Therefore, it is logical to hypothesize that A/B antigen(s) appeared after the separation of fish and amphibian lineages.

A code table was generated to correlate amino acid sequence motif with A/B specificity

Progresses have been made in understanding A/B transferases over the last decade. Among the 4 amino acid substitutions at codons 176, 235, 266 and 268 between the human A and B transferases, the third and fourth substitutions were shown to be crucial for different donor nucleotide-sugar substrate specificity whereas the second is influential and the first is not so important4. Our in vitro mutagenesis study30 and the determination of the three-dimensional structures of A/B transferases by others31 confirmed the critical roles of amino acids at codons 266 and 268.

In this study we prepared a library of 40 amino acid substitution constructs of human A transferase, which contained any one of potential 20 amino acid residues at codon 266 in combination with either glycine of A transferase or alanine of B transferase at codon 268. Furthermore, we also prepared additional constructs at codons 263–268 that contained deduced amino acids present in annotated ABO and related α1,3-Gal(NAc)T genes in the genome databases but were not represented in the library. DNA from those constructs was transfected to HeLa cells expressing cell-surface H substances and the expression of A/B antigens was examined immunologically, using antibodies against blood group A/B antigens, respectively. A code table was generated that correlates amino acid sequence motifs and A/B specificity of the enzymes encoded by the various constructs (Table 2). The activity is shown semi-quantitatively in a 4-fold exponential scale with 5+ highest and - none. The motifs observed in ABO genes in natura are shown in bold type.

Table 2 Specificity and activity of human A transferase expression constructs containing various amino acids at codons 263–268

The control constructs exhibited the anticipated specificity: AGG motif at codons 266–268 in pig A gene, LGG and MGA in human A and B alleles and GGA in mouse cis-AB gene for A, A, B and AB specificity, respectively. The results clearly demonstrated that the amino acid residue at codon 266 is crucial to determine the sugar specificity and activity of the encoded transferase. Some constructs possessing glycine at codon 268 exhibited different specificity/activity from those possessing alanine, suggesting that codon 268 is also important. A tendency of preferential use of galactose over GalNAc was observed by the Gly268Ala substitution, possibly because increased size in side chain at that position hinders larger GalNAc access whereas facilitating smaller galactose access. Several constructs with the amino acid sequence motifs that were overlapped with our previous study30 exhibited the same specificity/activity in spite of the differences in the A/B transferase backbone.

In addition to the constructs expressing either A or B transferase activity, several constructs exhibited both A and B transferase activities whereas several others showed none. For instance, human A transferase constructs containing AAA, CGG, or SGG motif exhibited A specificity, whereas those with IGA, MAA, MGS, or QGC exhibited B specificity. The constructs with MGG, SGA, TGA, or AAS showed both A and B specificity whereas those with AAN, TEA, or TGF showed neither. An unexpected finding was that glycine at codon 267 is not an absolute pre-requisite for A/B transferase activity. We next applied the codes to uniquely assign potential A/B specificity of the annotated ABO genes and critically evaluated several hypotheses on the evolution of the ABO genes.

A and B gene sequences appeared early in evolution and are potentially present in a non-allelic manner in some species

The first evidence of genomes with multiple copies of ABO gene sequences came from the Southern hybridization experiments showing multiple bands of hybridization in dog, rabbit and rat genomic DNA using a human probe32. Later studies demonstrated multiple genes in rat33. As shown in Fig. 1, additional species also seem to possess multiple ABO gene sequences. They are Xenopustropicalis frog, Chinese softshell turtle, platypus, microbat, dog, ferret, panda, horse, Kangaroo rat, rat and rabbit species. Genes flanking full/partial ABO genes are shown for each individual species in Table 3, together with the amino acid sequences corresponding to codons 266–268 of the human A/B transferases.

Table 3 Genes adjacent to ABO genes

We applied Table 2 to decode A/B specificity of individual ABO gene sequences annotated in various vertebrate species. It was found that several species not only contain multiple copies of ABO gene sequences but also they may have both A-specific and B-specific gene sequences in their genomes. For instance, Xenopustropicalis frog has A gene sequences with AGG or TGC motif and B gene sequences with MAA motif. Other species identified are: Chinese softshell turtle (AAA for A and MGA for B), platypus (AGG for A, MGA for B and LGA for AB), horse and rat (AGG for A and MGA for B), microbat (LGG for A and MGA for B) and rabbit (LGG for A and MGA and IGA for B). These results suggest that functional differentiation between A and B gene sequences appeared early in evolution, possibly just after the ABO gene emergence in amphibians.

As shown in Table 3, horse A and B gene sequences are closely located in tandem on the same chromosome. Therefore, if horse genome assembly is correct, those sequences may not be unigenic alleles. Microbat A and B gene sequences have not yet been mapped on chromosomes, however, at least one A and one B gene sequences of the three present in the genome were aligned side-by-side within a single contig (ENSMLUG00000029891 with LGG and ENSMLUG00000026173 with MGA in Scaffold GL431842: 18,186-26,341). Accordingly, they are not allelic, either. The rat genome in the Ensembl database lists 4 ABO gene sequences: 1 A (AGG) and 3 Bs (MGA). The surrounding chromosomal organization in Table 3 shows that those sequences are not alleles. Rat A and B gene sequences located tandemly in a cis-manner contrast to mouse gene (GGA) encoding a transferase with dual specificity (cis-AB enzyme)34.

However, heterogeneity seems to exist among different strains of rats. The Ensembl genome is from the BN/SsNHsdMCW strain. In addition to this strain, GenBank database also houses the genome sequence from another strain, the BN/Sprague-Dawley strain (Rn_Celera). 1 A (AGG) and 2 B (MGA) gene sequences, rather than 1 A and 3 B, were mapped for this strain. In another strain, Wistar, 3 A and 1 B gene sequences were cloned although they have not been mapped33. Different cloning results were obtained from inbred GOT-W strain35 and the BDIX strain36, further complicating the understanding of rat ABO genes.

In spite of potential errors and problems that are frequently associated with the sequences and genome assemblies of polymorphic genes and multi-gene families, the presence of multiple copies of non-allelic A and B gene sequences in rat and other species cannot be all attributed to bioinformatics mistakes. Even if sequence alignment all failed from the same caveats, the case still stands with rats at least. Because three different A and one B gene sequences were cloned from a single Wistar rat, they cannot be allelic at a single genetic locus33,37.

Many of non-allelic ABO protein sequences were clustered within species in phylogenetic analyses

Phylogenetic trees of ABO proteins/peptides were constructed from species having more than 1 annotated ABO gene (Fig. 3a). For comparison, the human A and B transferase sequences were included in the analysis although human sequences are allelic. Proteins corresponding to full genes with initiation and termination codons are marked with circles, whereas peptides corresponding to partial genes are marked with triangles. The symbols' colors indicate deduced potential A/B specificity (GalNAc, galactose, both, none and uncharacterized specificity are shown in red, green, yellow, blue and black, respectively). The amino acids corresponding to codons 266–268 of the human A transferase are shown in parentheses.

Figure 3
figure 3

(a): A phylogenetic tree of ABO proteins/peptides from species possessing multiple copies of ABO gene. Phylogenetic analyses were performed with protein/peptide sequences from species that contain more than one ABO genes in their genomes. Processed intronless retropseudogenes were excluded from analysis. The amino acid sequences were analyzed in its entirety. Potentially functional proteins from full genes with the initiation and termination codons and peptides from partial genes without them are marked with circles and triangles, respectively. The symbol's color indicates potential sugar specificity (GalNAc, galactose, GalNAc/galactose, none and unknown for red, green, yellow, blue and black, respectively). Amino acid sequences corresponding to the codons 266–268 of human A/B transferases are also shown in parentheses. Genes in the same species are bracketed. When potential A and B gene sequences are both present in a single species, the bracket was colored in red. Horse genes and ferret genes in 2 separate clusters are bracketed in blue and purple, respectively. Other species are bracketed in dark blue. (b): A phylogenetic tree of originally intronless ABO retropseudogene products. The entire protein sequences of processed retropseudogenes were analyzed. Branches leading to different amino acid sequences at the important positions are coded in different colors. (c): ABO gene evolution in bacteria. EMBL-EBI InterPro database listed 57 bacterial proteins within the GT6 family. 56 proteins/peptides, excluding 1 short one, were aligned to construct a phylogenetic tree. A gene from Helicobacter mustelae and B gene from Escherichia coli O86 strain were included in the study and their results are shown in bold type. The B gene-encoded protein (E1I6K1) consists of 234 amino acids and the bacterial protein sequences corresponding to codons 2–219 of this protein were analyzed. The amino acid sequence motifs corresponding to the codons 266–268 of human A/B transferases are also shown in parentheses. In E1I6K1 these correspond to codons 145–147. The symbols' color indicates sugar specificity of transferases: red, green and yellow for GalNAc, galactose and both, respectively, assuming that they are functional.

The majority of ABO protein sequences were clustered in species-specific groups, including platypus, microbat, rabbit and rat. However, several protein sequences from two distant species are on a common phylogenetic branch. Among them, two frog (both with MAA motif) and two turtle (with AAN and AAS motifs) sequences clustered together. However, those sequences were deduced to be nonfunctional, having aberrant gene organizations such as the absence of N-terminal exons or missing initiation/termination codons. Two ferret (IGA or MEA) and three panda (MGP, MGA and ---) protein sequences corresponding partial genes with aberrations in codon reading frame and gene structure, clustered on a common branch, apart from the ferret protein from a full gene with AGG motif. In horse species two genes (MGA and AGG) that are located side-by-side on the same chromosome were separated in the phylogenetic tree, possibly due to frameshift mutations deleting a serine close to MGA motif (MGAFFGGSV) and the accelerated accumulation of mutations after inactivation.

An intronless ABO gene cDNA was integrated into the mammalian genome

In addition to full/partial genes, ABO retropseudogenes also exist, originally derived from an intronless ABO gene cDNA that was integrated into the genome during the mammalian evolution (Fig. 1). Those retropseudogenes clustered separately from full/partial ABO genes in phylogenetic analyses and a phylogenetic tree of ABO retropseudogene products is shown in Fig. 3b. This tree suggests that the original sequence may have contained a TGA motif, which is present in some bacterial ABO genes (see below), but is missing in animal ABO genes that were analyzed other than the retropseudogenes. The implication and potential significance are unknown.

Several different molecular mechanisms may be responsible for animal AO polymorphism

Generation of enzymes with novel specificity and/or creation of genes with differential expression patterns must suffice special conditions and requirements. On the contrary, inactivation of gene function or annulment of transferase activity may be relatively easily achieved. Diverse inactivation mechanisms, including frameshift and missense mutations, have been identified in human O alleles4,8,16,23,38,39. Additionally, species-specific O alleles, which possibly resulted from independent silencing mutations, are known to exist in non-human primates40,41,42. In non-primate animal species unigenic AO polymorphism has been reported of pig, dog, rat, cow and rabbit43. The molecular mechanism of the porcine AO polymorphism was previously elucidated44,45. A major portion of the structural gene, including the entire coding sequence in the last coding exon, was found missing in O alleles from various pig strains.

Assignment of A/B specificity to individual ABO gene sequences has allowed us to investigate the molecular mechanisms that established AO polymorphism in other species. Two genes are annotated in dog species (with AGG or SGG). The AGG sequence is located in the consensus chromosomal region, but the SGG sequence is located on a different chromosome and seems to be nonfunctional as judged by abnormal gene structure with the last coding exon indel-disrupted. Therefore, AO polymorphism is suspected at the AGG gene locus. The examination of the coding sequence identified two interesting SNPs: rs9240920 [897G->A] and rs9240927 [701delG]. The former is a nonsense mutation (Trp299Ter) and the latter is a frameshift mutation. Therefore, the genes with either of these SNPs may account for some of the O alleles in the dog AO polymorphism.

An interesting finding was made when the chromosomal organization surrounding the ABO genes was compared between rat and mouse species. The mouse genome is of very high quality and many duplicated regions have been properly solved. Therefore, it provides a useful control. The gene organizations are similar except that a DNA fragment containing 3 ABO (1 A and 2 B) and several additional genes is present in rat between ABO and FAM69B genes (Table 3). The genes present specifically in this chromosomal region in the rat genome are shown in bold type. If the insertion occurred at the population level, the genome without the insert may be regarded as O allele. Alternatively, O alleles may have arisen from the genome with A gene by deletion/unequal crossover. The cow and rabbit genomes list one (A gene sequence with AGG motif) and four (1 A gene sequence with LGG motif, 1 B gene sequence with IGA and 2 B gene sequences with MGA, in addition to 4 retropseudogene sequences), respectively. The information on the ABO genes in those species is currently fragmental and their inactivating mechanisms of O alleles remain to be determined.

A/B allelism should have existed in primate ancestors and later inactivation at population level resulted in ABO polymorphism

Several primates exhibit ABO polymorphism and the repertoire of types are species-dependent40. The inter-species sharing of the ABO polymorphism led Landsteiner and Wiener to conceive the theory of trans-species evolution of polymorphism. In this concept the allele coalescence time of the most recent common allele ancestor predates the speciation time. We previously determined partial nucleotide sequences of the ABO genes from several primate species and demonstrated that amino acid residues corresponding to codons 266 and 268 of human A/B transferases are conserved in all the species examined, depending on A or B allele32. Later evolutionary analyses led to the hypotheses of trans-species inheritance17,22, convergent gene evolution18,19,20 and a combination of those21. Because the ABO gene inheritance in primates was still controversial46, we re-visited the topic for further evaluation, with additional experimental data on sugar specificity and activity of A/B transferases summarized in the code table.

Genome sequences in databases do not cover ABO polymorphism. Human reference and non-reference genes (both with LGG motif) in Ensemble database represent O and A alleles, respectively. The chimpanzee, gorilla and macaque genes with LGG, MGA and MGA, respectively, represent A, B and B alleles from those species. In all the primate species the chromosomal region containing ABO gene is similar to the consensus with minor differences (Table 3). The current EMBL-EBI InterPro database hosts non-overlapping 65 ABO protein/peptide sequences, including several proteins with MGG, MGS, or LGA motif.

The phylogenetic trees of primate ABO genes are complex22. However, A and B specificity may be ascribed to amino acid residues corresponding to human codons 266 and 268 and their neighbors, by narrowing down the scanning window. In this investigation we, instead, evaluated the convergent evolution theory from an enzymological point of view. As shown in Table 2, the A to B conversion of sugar specificity may be achieved not only by the change from LGG to MGA motif, but also by other amino acid substitutions and even with single amino acid substitutions. Note that only one base change may be sufficient for the conversion to FGG, HGG, or YGG motif with B specificity. The B to A conversion is also possible by changing to other amino acids than LGG. However, the conversion from MGA to an A specific motif may need at least 2 nucleotide changes, even for the single amino acid substitution to PGA.

Therefore, it is difficult to assume that the same LGG <-> MGA conversion occurred in so many different occasions during the evolution period of primates. Selection after random mutation(s) does not explain the convergent evolution hypothesis because other motifs than LGG and MGA are also enzymatically functional (see Table 2). Rather, current distribution may be easily explained by assuming that functional A and B alleles were both present in the common ancestors of primates.

Bacterial ABO genes evolved into 2 separate groups with different sugar specificities through horizontal and vertical gene transmission

In addition to eukaryotes, ABO specificity also exists in prokaryotes, especially in Gram-negative bacteria, which constitute the bulk of intestinal flora47. The first two ABO genes cloned from bacteria are from O86 strain of Escherichia coli and from Helicobacter mustelae, which express B and A antigens, respectively48,49. Analyzing 19 bacterial genes, horizontal gene transfer between eukaryotes and prokaryotes and among bacteria was proposed to explain the absence of ABO genes in many species of invertebrates, plants and fungi50. Because recent microorganism genome sequencings have identified additional bacterial ABO genes, we analyzed 56 bacterial proteins in EMBL-EBI InterPro database and constructed phylogenetic trees of bacterial ABO genes. A tree is shown in Fig. 3c. In contrast to vertebrate ABO genes, all the bacterial A genes with GalNAc specificity segregated from the B or AB genes with galactose or GalNAc/galactose specificity, respectively. Another important finding is that the bacterial ABO genes have a different variation in the amino acid sequence motif from the animal genes. AGG and CGG motifs were found in the A gene sequences, MGS and QGC in the B gene sequences and MGG, QGG and TGA were in the AB gene sequences. Whereas many of the motifs found in the bacterial ABO genes are also present in animal genes, QGG motif seems to be unique to bacteria. TGA motif was found in animal ABO retropseudogenes as described above. In Bacteroides and Parabacteroides species ABO genes were clustered separately for possible A gene sequences with AGG and possible AB gene sequences with QGG or MGG. In other bacterial species their genes were grouped in either of the two big clusters of A or B/AB genes.

Discussion

What is the evolutionary significance of the ABO gene and its polymorphism? We tackled this question, employing an integrative approach with standard phylogenetic techniques combined with molecular enzymology. Based on gene distribution, we first concluded that A/B transferase gene appeared after the separation of fish and amphibian lineages. Requirement of A/B transferases for an α1,2-linked fucosylated substrate strongly supports preceding emergence of α1,2-FT genes over A/B transferase genes. In this context it is noteworthy that coelacanth has a FUT2 gene sequence (although its functionality is questionable) and no ABO gene sequence. However, because coelacanth genome sequence is preliminary, a possibility remains that ABO gene may also exist in coelacanth. If this happens to be true, A/B gene appearance may be dated back to the time of lobe-finned fish appearance.

We created a code table correlating amino acid sequence motif with A/B specificity (Table 2). However, it should be noticed that having an active enzyme motif does not guarantee the gene function and sugar specificity. Mutation(s) in other position(s) may spoil the enzymatic activity51. Care must be taken to interpret the results because sugar specificity is based on the assumption that gene sequences encode functional glycosyltransferases, which is not always the case. A and B gene sequences can be O, depending on their functionality context42. Moreover, the table reveals one discordance, concerning the AYVYGS motif. The human A transferase construct containing this motif (in place of FYYLGG) at codons 263–268 did not exhibit A transferase activity whereas the H. mustelae bacterial gene having this sequence was reported to exhibit A activity. We assume that structural differences in other portions of the bacterial enzyme may have compensated for the activity variation.

We identified multiple copies of ABO gene sequences in a variety of species (Fig. 1), some of which possess sequence(s) with A-specific motif(s) and sequence(s) with B-specific motif(s) (Table 3). If multiple copies are found only in one species, the possibility exists that they were erroneously assembled. However, because this was observed in several different species, it seems unlikely that all those findings may be artifacts. In case of rats ABO gene duplication seems undeniably proved33,37. The number of species having both A and B gene sequences is expected to increase as new genome sequencing projects proceed, providing that duplicated regions are properly solved, which may be somewhat difficult in most NextGen sequencing projects. Irrespective of A/B specificity, phylogenetic analyses clustered those ABO gene sequences into a single cluster that was separated from the clusters of other α1,3-Gal(NAc)T genes (GBGT1, A3GALT2, GGTA1 and GLT6D1) (Fig. 2).

It is evident that animal A and B genes did not evolve into two separate genetic entities. Apparently, evolution suppressed the establishment of independent, functional A and B genes by certain mechanism(s). However, proximity in genetic distance does not seem to be responsible for this failed separation in spite of the fact that A and B gene sequences are situated very closely on a chromosome in some species. GGTA1(-1), GGTA1(-2) and GLT6D1(-1) genes are also closely linked, as well as SEC1 and FUT2 genes (Table 1). These genes, however, took independent evolution paths, as opposed to A and B gene sequences which did not. As shown in Fig. 1, the majority of GBGT1, A3GALT2 and GGTA1 genes possess conserved motifs of GGA, HAA and HAA, respectively. This restriction strongly suggests that those motifs are vital to their glycosylation reactions. However, there are some variations in the motif with ABO gene and more with GLT6D1 gene. A and B genes encode glycosyltransferases with distinct sugar specificity. However, both A and B transferases utilize the same H substances. Although this sharing of acceptor substrates may have contributed to mutual dependence of those two genes to a certain degree, it is not sufficient because SEC1 and FUT2 genes encoding α1,2-FTs with similar enzymatic characteristics still formed separate phylogenetic clusters.

Two modes of appearance and inheritance of A and B gene sequences in a given animal species may be contemplated to explain the results in Fig. 3a. One is that those sequences with different sugar specificity appeared recurrently after the separation from other analyzed species by convergent mutations. Another much likely possibility is that those sequences may have attained species-specific sequence homology through intergenic exchanges after A/B specificity was inherited from common ancestral genes. An examination of gene organization revealed that full genes with initiation and termination codons are rare in those species possessing multiple ABO gene copies. Many are partial genes that are incapable of encoding functional glycosyltransferases by themselves. We speculate that they may serve as a reservoir for genetic diversity to switch A/B specificity through gene conversion, exon shuffling, or recombination. In several species multiple ABO gene sequences are closely linked to one another, which facilitates recombination/gene conversion without genetic catastrophe, producing new possible adaptations at a higher rate than by nucleotide substitutions.

As mentioned above with rats, insertion/deletion/unequal crossovers/gene conversion seems to have occurred frequently at the ABO gene locus. It may have reduced gene number from several to one on certain occasions. Therefore, it is not too far-fetched to hypothesize that differential deletions/crossovers may have resulted in differential outcomes. Starting from tandemly linked A and B gene sequences, A and B alleles may have been created (the multigenic-to-unigenic transition hypothesis). New functional allele(s) may have been generated within partial and nonfunctional sequence(s) so far as changes in gene organization could restore their functionality to encode active enzymes that are expressed after being inserted or copied in the functional gene(s). An example of such restored function (and not merely changing it) has recently been demonstrated of human A allele by recombination from functional B allele and nonfunctional O allele52. Those events may have taken place before simians appeared. Rats and rabbits have A genes with AGG and LGG, respectively. Therefore, prosimians and simians may have inherited an A gene with LGG similar to Lagomorpha genes, rather than Rodentia genes, because no genes with AGG motif are found in primates22. An alternative explanation would be the unigenic-to-multigenic transition hypothesis: A/B allelism appeared first and then natural selection favored duplication events in many species to separate both alleles whereas this separation did not occur in primates. This is an interesting hypothesis because it may easily explain the absence of separate evolution of A and B genes. However, it seems to be less likely because all the other species than primates, which are known to have unigenic polymorphism, exhibit AO and not AB, polymorphism43.

Based on the relationship between amino acid motifs and A/B specificity, we have shown that A and B alleles with LGG and MGA motifs, respectively, existed in common ancestors of primates. This suggests that they were inherited, most probably, in a trans-species manner. However, the fact that other motifs than LGG and MGA also exist in some primate species signifies that mutations/recombination also happened, of which several may be the result of convergent evolution. For instance, LGA motif is found in Ecuadorian squirrel monkeys and humans and MGG is found in Ecuadorian squirrel monkeys, Weeper capuchins and humans, although cases of cis-AB (with LGA or MGG motif) are rare in humans. These motifs may be derived from either LGG or MGA by point mutation or by recombination of those two alleles, still supporting the inheritance of an ancestral polymorphism with A allele (LGG) and B allele (MGA) as prototypic alleles. MGS motif in titi monkeys may have resulted from MGA by a single nucleotide substitution, rather than from LGG by 2 amino acid substitutions.

In addition to primates, many other animal species analyzed also maintain the prevailing motifs of LGG and MGA although AGG is also frequent in non-primate animals. Considering that additional motifs may also render the ABO gene-encoded proteins enzymatically active as demonstrated in the code table, those 3 motifs may be considered ancestral for those species. However, to evaluate this possibility further characterization of additional ABO genes from many other species, including amphibians and reptiles, will be needed. ABO genes seem to have evolved under more or less constant selective pressure for some polymorphism in their catalytic specificity, which in some species is achieved by carrying different gene copies (multigenic polymorphism) and in some other species through allelic polymorphism of a single gene (unigenic polymorphism). Whether the latter is limited to primate species or not needs to be determined in order to conclusively prove or disprove the multigenic-to-unigenic transition hypothesis.

The A/B antigen expression depends on the A/B genotype of individual. Although human and several other species express A/B antigens on red blood cells, the expression on RBCs is relatively rare. On the contrary, epithelial cells, including those of the gastrointestinal tract, express A/B antigens in many species. Accordingly, its significance may be better found in that cell-type. Many of cell-surface oligosaccharide structures are involved in microbial interactions and ABH antigens are not an exception53. Actually, ABO polymorphism has been associated with certain infectious diseases54,55,56. The presence/absence of A/B antigens and concordant absence/presence of anti-A/B antibodies provide strong defensive lines against infection. Having ABO gene should be beneficial because many vertebrate species maintain this gene. However, having both functional A and B genes ubiquitously within species might not be so advantageous because they may eventually lose anti-A/B antibodies. Rather, frequent gene conversion of A/B specificity producing amino acid substitutions or recombination with nonfunctional partial genes may have conferred an adaptation against microbial attacks. Different ABO phenotypes in different species and ABO polymorphism within species may inhibit inter-species and intra-species infections, respectively. Our results conformed to the hypothesis that host organisms attained the variation utilizing those two molecular mechanisms.

We unexpectedly observed the separate clustering of bacterial ABO genes into 2 groups with different sugar specificities (A and B/AB genes) (Fig. 3c), as opposed to animal ABO genes, of which A and B genes did not evolve independently. Widespread presence of A/B genes in bacteria47 indicates that ABO mimicry is advantageous to survival. The bacterial ABO genes have been transmitted horizontally to different bacteria and vertically through generations. We reason that these mixed modes of gene inheritance have allowed the segregated evolution of the bacterial ABO genes in 2 groups. It is evident that horizontal gene transfer has been providing bacteria with easier adaptation against host defense system. Contrastingly, interactions with infectious agents may have stimulated the host ABO gene evolution, as intra-species polymorphism may help the survival of host species by changing allele frequency through balancing selection.

In conclusion, the systematic functional analysis correlating amino acid sequence motifs with A/B specificities opened a new venue to investigate the ABO gene and protein evolution. Together with phylogenetic analyses, we have gained invaluable insights into the evolutionary significance of the ABO gene and its polymorphism and successfully decoded several important questions.

Methods

Materials

Reagents for PCR, restriction endonucleases, T4 DNA ligase and other enzymes were purchased from LifeTechnologies (Carlsbad, CA) and New England BioLabs (Ipswich, MA). HeLa cells, human cancer cells of uterus, were originally obtained from American Type Culture Collection (ATCC) and have been maintained in the laboratory over a decade. Cell culture media, frozen transformation-competent E. coli bacteria and Lipofectamine 2000 were also purchased from LifeTechnologies. Oligodeoxynucleotides were custom-synthesized at the same company. Anti-A and anti-B murine monoclonal antibody mixtures were from OrthoDiagnostic Systems (Piscataway, NJ) and Vectastain ABC System and DAB (3, 3′-diaminobenzidine) substrate for color development were from Vector Laboratories (Burlingame, CA).

In vitro mutagenesis of human A transferase expression construct

We employed a PCR-mediated in vitro mutagenesis approach as previously described30. Degenerate oligodeoxynucleotides were used to introduce amino acid substitutions at codon 266 and 268 of human A transferase. The primers originally used for a library construction were the followings:

FYV7 (T7-F): 5′-TAATACGACTCACTATAGGG

FYV1 (SV40 polyA-R): 5′-GAAATTTGTGATGCTATTGC

IMPPC235 (F): GGCGATTTCTACTACNNNGGGGSGTTCTTCGGGGGGTC

IMPPC236 (R): GACCCCCCGAAGAACSCCCCNNNGTAGTAGAAATCGCC

The capitalized underlined letters N and S denote a mixture of 4 nucleotides (G/A/T/C) and 2 nucleotides (G/C) at those positions. Human A transferase expression construct57 prepared in pSG-5 vector (Stratagene, La Jolla, CA) was used as a PCR template. Two consecutive rounds of PCR reactions were performed, first with FYV7 (T7-F) and IMPPC236 (R) primers and separately with IMPPC235 (F) and FYV1 (SV40 polyA-R) primers and second by mixing both the reactions. The PCR products were cleaved with SacII and BamHI restriction enzymes and ligated with the SacII-BamHI vector fragment of human A transferase expression construct. After DNA transformation of E. coli bacteria, plasmid DNA was prepared from transformant colonies, sequenced and the constructs containing intended amino acid substitutions but lacking additional non-synonymous mutations were selected for DNA transfection experiments. For those constructs, which we failed to obtain by using degenerate oliogodeoxynucleotide primers and those constructs, which were not covered by the library approach, specific primers were designed for individual constructions (not shown).

DNA transfection and immunostaining

HeLa cells were used as a recipient of DNA transfection. These cells were derived from a type O individual and exhibit cell surface H substances. When functional A/B transferases are expressed by DNA transfection, H substances are converted to A/B antigens. We have used this system at various occasions to examine the specificity and activity of A/B transferase variants30,57,58. DNA transfection experiments were performed using 96-well plates as previously described59. Lipofectamine 2000 reagent was used, following the manufacturer's instructions. DNA from the FUT2 expression construct prepared in pSG-5 and DNA from the pEGFP-N1 vector (GenBank Accession #U55762) were co-transfected: the former to increase the acceptor substrate availability and the latter to calculate the transfection efficiency for activity adjustment. Two days after DNA transfection, GFP-positive cells were counted. The next day, cells were fixed with paraformaldehyde and washed with PBS. After drying, cells were treated first with either anti-A or anti-B monoclonal antibodies, second with biotinylated anti-mouse IgM, then with Avidin/Biotinylated Peroxidase Complex (ABC), followed by color development using DAB substrate. Stained cells were counted microscopically and A/B specificity and activity were determined after adjusting the transfection efficiency using GFP-positive cell counts. Because of variable detachment of cells from dish substratum during fixation and immunostaining procedures, data were presented in a semi-quantitative manner.

Databases, sequence alignment and construction of phylogenetic trees

Nucleotide and amino acid sequences, exon-intron organizations and chromosomal locations of α1,2-FT genes (FUT1/FUT2/SEC1) and α1,3-Gal(NAc)T genes (ABO/GBGT1/A3GALT2/GGTA1/GLT6D1) were retrieved from Ensembl (www.ensembl.org/index.html) and GenBank (www.ncbi.nlm.nih.gov/genbank/) genome sequence databases. Protein/peptide sequences of the ABO genes were retrieved from the EMBL-EBI InterPro database (www.ebi.ac.uk/interpro/).

Ensembl genome sequence database (release 73) listed 89 annotated α1,2-FT genes with 66 speciation nodes and 15 duplications in the ENSGT00390000001450 gene tree and 255 annotated α1,3-Gal(NAc)T genes with 185 speciation nodes and 65 duplications in the ENSGT00400000022032 gene tree. The phylogenetic tree in Fig. 2 was constructed by the neighbor-joining method60. JTT model61 was used for estimating number of amino acid substitutions and 1,000 bootstrap replications were computed by using MEGA562. The phylogenetic trees in Fig. 3 were constructed by Maximum Likelihood method, using the same software.