SMIM1 variants rs1175550 and rs143702418 independently modulate Vel blood group antigen expression

The Vel blood group antigen is expressed on the red blood cells of most individuals. Recently, we described that homozygosity for inactivating mutations in SMIM1 defines the rare Vel-negative phenotype. Still, Vel-positive individuals show great variability in Vel antigen expression, creating a risk for Vel blood typing errors and transfusion reactions. We fine-mapped the regulatory region located in SMIM1 intron 2 in Swedish blood donors, and observed a strong correlation between expression and rs1175550 as well as with a previously unreported tri-nucleotide insertion (rs143702418; C > CGCA). While the two variants are tightly linked in Caucasians, we separated their effects in African Americans, and found that rs1175550G and to a lesser extent rs143702418C independently increase SMIM1 and Vel antigen expression. Gel shift and luciferase assays indicate that both variants are transcriptionally active, and we identified binding of the transcription factor TAL1 as a potential mediator of the increased expression associated with rs1175550G. Our results provide insight into the regulatory logic of Vel antigen expression, and extend the set of markers for genetic Vel blood group typing.

Recently, we and two other research groups identified a previously unreported erythroid gene, Small Integral Membrane Protein 1 (SMIM1), as the locus of the Vel blood group system [1][2][3] . We showed that individuals who lack the Vel blood group antigen (about 1/1,200 in Sweden and 1/4,000 globally) are homozygous for an inactivating 17-bp deletion in SMIM1 exon 3.
While these findings provide a molecular background for the Vel-negative (Vel−) phenotype, they do not explain the recurrent clinical observation that Vel-positive (Vel+ ) individuals show great variability in antigen expression. Weakly Vel+ erythrocytes (Vel+ weak ) can be mistyped as Vel− by routine serological phenotyping, creating risk for transfusion reactions in patients with anti-Vel inadvertently transfused with Vel+ weak erythrocytes. While some of the variation can be attributed to heterozygosity for the 17-bp deletion 1 , this mutation does not account for all the variation. The common SNP rs1175550 (A > G, minor allele frequency, MAF = 0.22 in Caucasians), located in intron 2 of SMIM1, associates with mean corpuscular hemoglobin concentration (MCHC) 4 , and individuals who carry the minor allele (rs1175550G) express SMIM1 at a higher level than individuals who carry the major allele (rs1175550A) 2,5,6 . However, it is not known whether rs1175550 itself is causal, or a proxy for another (as yet undetected) causal variant. Moreover, the molecular effects of this variant have not been defined, or whether additional variants modulate SMIM1 expression.
To address these questions, we fine-mapped a candidate regulatory region in SMIM1 intron 2 containing rs1175550 and multiple erythroid transcription factor binding sites in Swedish and African American blood donors. We identified rs1175550 and a previously unreported tri-nucleotide insertion (rs143702418; C > CGCA) as correlated with SMIM1 and Vel antigen expression. While these variants, and their effects on expression, were inseparable in the Swedish samples, their effects could be separated in the African American samples where the linkage disequilibrium in SMIM1 intron 2 was not as tight. Our data show that rs1175550 and rs143702418 Results rs1175550 is located in a regulatory region in SMIM1 intron 2. To understand the role of rs1175550, we examined its neighborhood in SMIM1 intron 2 for transcription factor binding sites using ChIP-seq data in the Encyclopedia of DNA Elements compendium (ENCODE) 7 . This revealed a 500 bp region with increased acetylation of lysine 27 on histone 3, increased DNaseI hypersensitivity, and binding sites for multiple transcription factors, including the erythroid factors GATA-1, TAL1 and ZBTB7A. The region contained eight sequence variants annotated with a MAF of more than 1% in dbSNP 144 8 (Fig. 1a). rs1175550 and rs143702418 associate with SMIM1 and Vel expression. To test for correlations with SMIM1 and Vel expression, we sequenced the identified region in 150 Vel+ Swedish blood donors. In the same individuals, we quantified SMIM1 and Vel antigen expression by quantitative PCR and flow cytometry. We detected all eight candidate sequence variants at frequencies similar to those reported in the 1000 Genomes catalog, phase 3 9 containing fully sequenced genomes from 2,504 individuals grouped into five superpopulations (African, American, East Asian, European and South Asian) ( Table 1). The rs1175550 variant was strongly associated with SMIM1 mRNA levels (p = 2.1•10 −8 ), Vel antigen expression (p = 4.0•10 −15 ) and SMIM1 protein on erythrocyte membranes, with the G allele correlated with higher expression (Fig. 1b-d). Interestingly, we found almost identical allele frequencies and correlation values for a previously unreported C > CGCA insertion rs143702418 (rs70940313 in reverse complement), located only 96 basepairs upstream of rs1175550 (Table 1, Supplementary Figs S1 and S2). We found no significant correlations for the remaining six variants (Supplementary Figs S1 and S2). The two variants showed near-perfect linkage disequilibrium (LD), with rs1175550G being linked to rs143702418CGCA in 149 out of 150 samples (Table 1 and Supplementary Data S1). Furthermore, in the three predominant SMIM1 intron 2 alleles (comprising 96% of all identified alleles) in Europeans in the 1000 Genomes project the rs1175550G/rs143702418CGCA and rs1175550A/rs143702418C genotypes were always linked ( Supplementary Fig. S3). Thus, both rs1175550 and rs143702418 associate with SMIM1 and Vel antigen expression, yet the strong LD precluded statistical separation of their effects using samples from individuals of European ancestry.

rs1175550 and rs143702418 independently influence SMIM1 and Vel antigen expression.
Hypothesising that the LD between rs1175550 and rs143702418 might be different in other populations, we examined the repertoire of SMIM1 intron 2 alleles in the 1000 Genomes catalog. We observed 32 different alleles, although nine of these were only present in a single individual in 1000 Genomes, and could therefore represent sequencing artifacts. Analysis of alleles detected in at least two individuals showed that these cluster mainly by rs1175550 genotype, indicating that this variant appeared evolutionarily before the other variants in the region ( Supplementary Fig. S4).
Analysing the 1000 Genomes data, we noted a difference in genotype frequency between the two variants in the African super population (allele frequency 0.60 and 0.26 for rs1175550G and rs143702418CGCA, respectively), meaning the two variants are not in tight LD: the frequency of the unlinked rs1175550G/rs143702418C allele was 0.26 in African Americans, compared to 0.01 in Europeans (Fig. 2a). To deconvolve the effects of rs1175550 and rs143702418, and thereby assess their causality, we analysed samples from 202 African American blood donors. These samples showed a broader repertoire of SMIM1 intron 2 alleles when compared to the Swedish samples, including alleles where the linkage between rs1175550 and rs143702418 was broken (Supplementary Data S2).
Genotype frequencies in the African American sample set matched those in the 1000 Genomes catalog (Table 1). In this sample set, we found rs1175550 to be associated with SMIM1 mRNA and Vel antigen expression (p = 1.0•10 −15 for Vel antigen expression; p = 7.0•10 −8 for SMIM1 mRNA), whereas rs143702418 showed weaker association (p = 0.0002112 for SMIM1 mRNA; p = 0.0103 for Vel antigen expression) (Fig. 2b-e). None of the other candidate variants showed any significant correlation ( Supplementary Figs S5 and S6).
To test if the association with rs143702418 represents an independent effect, we used multiple linear regression including both rs1175550 and rs143702418 in the model. We found that samples heterozygous for rs143702418 had significantly lower Vel antigen expression (effect estimate, p = 0.00398) compared to homozygosity for rs143702418C (Fig. 3a), and observed a trend towards further decrease for samples homozygous for rs143702418CGCA (effect estimate, p = 0.08939). This was consistent with an increase in the model R 2 as compared to a model including rs1175550 alone (adjusted R 2 = 0.5237 vs. 0.5064 for rs1175550 alone; p < 2.2•10 −16 for model), indicating that rs143702418 has a small but independent effect (Fig. 3a, Supplementary Table S1). This trend was mirrored by the mRNA expression levels, although it was not statistically significant, possibly due to the limited sample size (Fig. 3b). No independent effects were observed for the other six candidate SNPs (Supplementary Table S1). rs1175550 and rs143702418 are transcriptionally active. Sequence analysis showed that rs1175550 and rs143702418 are located at predicted binding sites for erythroid transcription factors. For rs1175550, the major allele predicts a non-canonical GATA-1 site, GATT, modified by the minor allele to GGTT. For rs143702418, we observed that the major allele comprises a core motif for KLF1 (CACCC), which is modified by the minor allele to CACGCACC.
For functional validation of rs1175550 and rs143702418, we made reporter constructs corresponding to the four theoretical alleles containing the combinations of these two biallelic variants (Fig. 4a). Consistent with our association data, luciferase assays in K562 and HEL erythroleukemia cells showed higher activity with both  rs1175550G and rs143702418C (Fig. 4b), and both constructs with rs143702418C showed higher activity than those with rs143702418CGCA (K562 p = 0.000683; HEL p = 0.002703). The highest luciferase activity was found for the construct also carrying the rs1175550G (p = 0.03448 in the HEL cells). Gel shift assays with nuclear extracts from K562 cells showed distinct binding patterns with probes mapping to the minor and major alleles for both rs143702418 and rs1175550 (Fig. 5a,b), most notably a distinct, shifted band only seen with the rs1175550G probe (Fig. 5b, lane 4). These results further support that the two variants independently modulate SMIM1 and Vel antigen expression, although rs1175550 has a more powerful effect.
Identification of TAL1 as a candidate factor underlying the increased SMIM1 expression associated with rs1175550G. Based on ChIP-seq data from ENCODE (Fig. 1a), we identified GATA-1, KLF1, TAL1 and ZBTB7A as candidate factors for regulating SMIM1 expression. These factors have been previously identified as erythroid, and their temporal expression profiles in erythrocyte development follow that of SMIM1 ( Supplementary Fig. S7) 10 . We also included Gfi-1B as its binding motif AATC (reverse-complement GATT) matched the sequence at rs1175550, and it has been shown to associate with GATA-1 and mediate transcriptional repression [11][12][13] . We carried out gel shift analyses with antibodies to the selected transcription factors. Firstly, we observed a supershift with anti-GATA-1 for rs1175550 (Fig. 5b, lanes 3,6). However, this signal was not allele-specific and additional analyses with probes mutated at the predicted non-canonical GATA-1 site at rs1175550 yielded similar results, indicating that GATA-1 does not bind directly to rs1175550 (Fig. 5b, lanes 8,10). Secondly, with anti-Gfi-1B, we observed that the strong band seen with both rs1175550A and rs1175550G probes was weakened with anti-Gfi-1B, whereas the allele-specific band observed only with the rs117550G probe remained unaffected (Fig. 5a, lanes 3,9). These data indicate that GATA-1 and Gfi-1B bind near rs1175550 (directly or indirectly) regardless of genotype, making it unlikely that the increased expression associated with rs1175550G is explained by altered binding of GATA-1 or Gfi-1B.
In contrast to the non-allele-specific reactions seen with anti-GATA-1 and anti-Gfi-1B, we achieved suppression of the rs1175550G-specific signal with anti-TAL1 (Fig. 6a, lane 10). No similar effects were seen with anti-Gfi-1B, anti-ZBTB7A or anti-KLF1 (Fig. 6a), while the TAL1 effect was dose-dependent (Fig. 6b,c). This is particularly interesting since only three bps upstream of the rs1175550 is a near-perfect match for the E-box motif CAGNTG, which is a known binding site for TAL1 in heterodimer complexes with E12 and E47 14 . Finally, we observed no allele-specific supershift or suppression for rs143702418 using these antibodies (Fig. 5a). In summary, our results suggest that differential binding of TAL1, or a TAL1-containing complex, could mediate the increase in SMIM1 and Vel antigen expression associated with rs1175550G.

Discussion
The variation in antigen expression among Vel+ individuals is clinically important, as it may lead to erroneous typing of Vel+ blood as Vel− with anti-Vel sera 15,16 . While rs1175550 zygosity may explain some of this variation, the causality of this sequence variant has been unclear. Fine-mapping the genomic neighbourhood of rs1175550, a regulatory region in SMIM1 intron 2, we discovered that the previously unreported trinucleotide insertion rs143702418 also correlates with SMIM1 and Vel antigen expression. While the effects of rs1175550 and rs143702418 were inseparable in Swedish samples, deconvolution in African American samples revealed that rs1175550 and rs143702418 both modulate Vel expression, although rs1175550 has the strongest effect. Luciferase and gel shift assays supported that both variants are transcriptionally active.
Although the exact mechanisms remain to be elucidated, we identified increased binding of TAL1 as a potential explanation for the increased SMIM1 expression associated with rs1175550G. This is in concordance with recently published data suggesting that TAL1 preferentially binds to rs1175550G 17 . Since the discussed TAL1 and GATA-1/Gfi-1B binding motifs are only three base pairs apart, one could speculate that steric hindrance does not allow these factors to bind simultaneously and that the A > G substitution favours TAL1 binding. rs1184341 rs2797432 rs143702418 1181893  Anti-Vel reactivity In conclusion, our findings provide novel insight into the regulation of SMIM1 and the Vel blood group antigen, and provide further reason to take rs1175550 and rs143702148 into account when evaluating the correlation between Vel blood group phenotype and genotype. Insight into what governs blood group expression levels on erythroid cells can be important for our understanding of host-pathogen interactions since many blood group molecules serve as involuntary receptors for microbial agents and may therefore act as susceptibility markers for disease. Even if no such role has yet been proven for SMIM1, it has been hypothesised to be a long-sought malaria receptor 1 .

G/G G/A A/A A/A A/G G/G C/C C/A A/A A/A A/C C/C
Recently, trans-ancestry association analysis has been proposed as a way to refine associations between genetic variants, quantitative traits and human diseases, yet only a few examples have been published so far [18][19][20] .
Here we exploited pre-existing population-based data sets and trans-ancestry association analysis to deconvolve highly correlated effects. We predict that these approaches will be increasingly important for identifying the molecular-genetic effects of GWAS loci.

Methods
Blood samples. We used anonymised, peripheral blood samples from 150 Swedish blood donors (routine donations at Clinical Immunology and Transfusion Medicine, Lund, Sweden), and 202 self-declared African-American blood donors (routine donations at New York Blood Center, New York, NY, USA). No donors were approached solely for the purpose of this study. Genomic DNA and total RNA were isolated using standard methods. Samples were screened for the 17-bp deletion in SMIM1 exon 3 1 and carriers were excluded.
Reverse transcription quantitative PCR. Total RNA was extracted by Trizol LS purification (Ambion) and converted to cDNA using High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems). Quantitative PCR was carried out on cDNA using an assay mix specific for SMIM1 (Applied Biosystems assay no. Hs01369635_g1). SMIM1 expression was quantified as 2 −ΔΔCT against the sample having the highest expression value. For normalisation across multiple qPCR runs these samples were included in each experiment.
Anti-Vel reactivity   K562 and HEL cells were cultured at 37 °C and 5% CO 2 in RPMI 1640 medium (Gibco, Life Technologies) supplemented with 10% fetal bovine serum (Gibco). 2•10 6 cells were mixed with 2 μ g construct and 0.2 μ g Renilla luciferase construct and electroporated at 960 μ FD and 280 V. Following incubation for 24 hours, dual luciferase assays were performed according to manufacturer's protocol (Dual-Luciferase Reporter Assay System, Promega). All experiments were performed in triplicate.

Statistical analysis.
Kruskal-Wallis one-way analysis of variance with Dunn's multiple comparison tests as well as multiple linear regression models were used to test for association between genotypes vs. SMIM1 (RT-qPCR) or Vel antigen expression (flow cytometry). In a first analysis, we included only rs1175550 as the explanatory variable. To assess the effect of variants conditioned on rs1175550, we carried out additional analyses using bivariate models with each variant together with rs1175550. For the phylogenetic analysis, we used phased genotyping data from The 1000 Genomes Project, phase 3 for the studied region in SMIM1. To reconstruct the phylogeny, we calculated the Euclidean distance between the different alleles, and clustered the alleles using the neighbour-joining method 23 . R version 3.2.2, Graphpad Prism version 7.0a and Microsoft Excel were used for data analyses.