HLA-G genetic diversity and evolutive aspects in worldwide populations

HLA-G is a promiscuous immune checkpoint molecule. The HLA-G gene presents substantial nucleotide variability in its regulatory regions. However, it encodes a limited number of proteins compared to classical HLA class I genes. We characterized the HLA-G genetic variability in 4640 individuals from 88 different population samples across the globe by using a state-of-the-art method to characterize polymorphisms and haplotypes from high-coverage next-generation sequencing data. We also provide insights regarding the HLA-G genetic diversity and a resource for future studies evaluating HLA-G polymorphisms in different populations and association studies. Despite the great haplotype variability, we demonstrated that: (1) most of the HLA-G polymorphisms are in introns and regulatory sequences, and these are the sites with evidence of balancing selection, (2) linkage disequilibrium is high throughout the gene, extending up to HLA-A, (3) there are few proteins frequently observed in worldwide populations, with lack of variation in residues associated with major HLA-G biological properties (dimer formation, interaction with leukocyte receptors). These observations corroborate the role of HLA-G as an immune checkpoint molecule rather than as an antigen-presenting molecule. Understanding HLA-G variability across populations is relevant for disease association and functional studies.


Samples and methods used to characterize HLA-G genetic diversity
We evaluated 4738 individuals from 88 different population samples from four datasets, all of them sequenced in high-coverage (> 30×) and processed through the same computational pipeline. The first dataset was the new high-coverage sequencing data of the 1000 Genomes consortium 36 , the second consisted of 831 samples from the International Genome Sample Resource (IGSR) consortium 37 , the third encompassed 1323 elderly Brazilians from the Saúde, Bem-estar e Envelhecimento (SABE) cohort 38 , and the fourth comprised the high coverage sequencing of 80 individuals from Southeast Asia and Oceania, including the Agta hunter-gatherer people from the Northern Luzon island, Philippines. The population samples and their sample sizes are in Table S1. In all cases, we started from a genome-wide BAM file with reads aligned to the reference hg38 genome (version hg38DH) by using BWA MEM (https:// sourc eforge. net/ proje cts/ bio-bwa/).
The method used to extract the HLA-G data is described in detail in a file (the supplementary methods) with a description of all methods and variables we used. This method also follows the same strategy used to characterize HLA class I genes in Brazilian samples 38 . Using a single pipeline to extract the data from all samples is essential to avoid biases in the genotyping and haplotyping procedures. In brief, we used the hla-mapper version 4.0 to optimize read alignment in the HLA complex, minimizing alignment errors and cross-alignments that are commonly observed among HLA genes 39 . We have called genotypes using GATK HaplotypeCaller in the GVCF mode (https:// gatk. broad insti tute. org) with a further refinement step using vcfx (www. caste lli-lab. net/ apps/ vcfx). This method allows the detection of any previously unknown HLA-G variant. The haplotype inference used both read-aware phasing (GATK ReadBackedPhasing) and probabilistic models with the workflow phasex/shapeit4 (www. caste lli-lab. net/ apps/ phasex). Please refer to the file (supplementary methods) for details regarding the www.nature.com/scientificreports/ mapping, genotyping, and haplotyping procedures. Applying the bioinformatic pipeline, 4640 samples (97.9%) passed the quality control and the haplotyping procedure, as presented in Table S1.
To support the quality of the data presented here, we evaluated departures of the Hardy-Weinberg expectations (HWE) at the SNP level in population samples with at least 10 individuals (63 populations out of 88). Of those, 55 (87.3%) present no HWE deviation, and 8 populations present just two deviations. Thus, most SNPs fit HWE.
Resources used to evaluate the HLA diversity. The method described above allowed us to retrieve SNP and haplotype data from approximately 7870 nucleotides surrounding the HLA-G gene, from 4328 bases upstream of the first translated ATG to 100 nucleotides downstream HLA-G. This large region includes most of the regulatory elements, all exons and all introns. We extracted coding (exons + introns), exonic (the CDS), promoter, and 3'UTR sequences. We also translated CDS sequences into proteins (the allotypes).
In the following sections, we will discuss the HLA-G polymorphisms across the world, providing insights regarding the HLA-G genetic diversity, and present a resource for future studies evaluating HLA-G polymorphisms and disease association studies in different populations. We also provide several useful tools to evaluate HLA-G diversity ( Table 1) that can be downloaded from the website www. caste lli-lab. net/ HLA-G or are included as supplementary material.

Nomenclature, SNPs, nucleotide diversity, and linkage disequilibrium across HLA-G
Due to the lack of a standardized nomenclature for the regulatory regions, individual regulatory region haplotypes have been previously designated as "distal/proximal promoter regions", and "3'UTR haplotypes", as proposed by our group 29,40,41 . Because the HLA-G region evaluated in this study includes an extensae region (from − 4328 upstream the first translated ATG to 100 nucleotides downstream the transcription end site), and there is no designated nomenclature to nominate the sequences we have characterized (for instance, when the sequence includes the complete promoter, all introns, and the complete 3'UTR sequence), we used the term GEMBIO_HLA-G_H1 (complete haplotype number 1, for HLA-G, available in the GeMBio laboratory Database) to designate this large gene sequence, which is available for download (Table 1). These names can be converted to whatever future nomenclature will be available and standardized. Whenever this complete nomenclature is used in this text, the previous names of the regulatory (Distal-G, PROMO-G, and 3'UTR) and coding region (as defined by the IPD-IMGT/HLA database) haplotypes are also mentioned.
We detected 442 different variants encompassing approximately 7870 nucleotides surrounding the HLA-G locus. The majority are bi-allelic single nucleotide exchanges, but there are some indels, multi-allelic variants and combinations of them. The list of each variant, with its reference and alternative alleles, and their global  (Table S2) Genomic alleles (4-digit resolution) frequencies This table provides the frequency for each genomic allele (4-digit resolution) in each population we have studied, in XLSX format. There are 3 sheets, one for biogeographic regions, one for countries, and one for specific population samples Supplementary material (Table S3) Allotypes frequencies (2-digit resolution) This table provides the frequency for each allotype (2-digit resolution, full-length protein) in each population we have studied, in XLSX format. There are 3 sheets, one for biogeographic regions, one for countries, and one for specific population samples Supplementary material (Table S4) 3

'UTR frequencies
This www.nature.com/scientificreports/ frequencies, is available in Table S2. The VCF containing the phased genotypes of every individual is available upon request. These 442 variants are arranged into 514 different haplotypes or full-length sequences ( Table 1) and some of these haplotypes are quite common. The two most frequent ones are: (1) GEMBIO_HLA-G_H1 presents a frequency of 21.6% and is associated with known haplotypes, such as the distal promoter Distal-010101a, the proximal promoter PROMO-G010101a, the coding allele G*01:01:01:01, and the 3'UTR haplotype known as UTR-01, and (2) GEMBIO_HLA-G_H2 presents a frequency of 10.8% and carries the distal promoter Distal-010102a, the proximal promoter PROMO-G010102a, the coding allele G*01:01:02:01, and UTR-02 29,30 . We observed 20 full-length haplotypes presenting global frequencies higher than 1%, exhibiting a summed frequency of 76.85% (from GEMBIO_HLA-G_1 to GEMBIO_HLA-G_20). Thus, despite the presence of many variants, there are few full-length frequent haplotypes, reflecting the high Linkage Disequilibrium (LD) along the HLA-G locus. In Fig. 1, we present the LD pattern along the HLA-G locus for bi-allelic variants with a minor allele frequency (MAF) > 0.1% and fitting HWE (106 variants).
The entire HLA-G gene is included within a single segregation block. Although Fig. 1 illustrates the global LD pattern, we observed similar patterns when considering only samples from Europe, Africa, Asia, and American. Most of the pairwise comparisons present D' = 1 ( Fig. 1 and S1), and many comparisons present r 2 > 0.8. We demonstrate that this high LD extends at least 4 kb upstream HLA-G to 100 nucleotides downstream. Moreover, there are many variants in complete LD (r 2 = 1.00). Of the 106 variants plotted in Fig. 1, 43 (40.6%) present another variant in complete LD. For instance, rs17875397 (position − 56) is in absolute LD with nine other variants, including rs17875400 (+ 507) and rs17875405 (+ 1534). There are also many variants in almost complete LD (measured by r 2 > 0.95). For instance, considering the threshold r 2 > 0.95, 55 selected SNPs can tag all 106 frequent variants along HLA-G. Because of that, LD among HLA-G variants must be considered when performing any association study to disregard hitchhiking effects. This high LD explains the presence of a few haplotypes that are frequent globally when analyzing almost 8 Kb surrounding HLA-G. This high LD was previously proposed evaluating other population samples 29,40 , and it may extend at least more 20 Kb downstream the gene, where an Alu insertion accompanies the G*01:01:01:01 allele 42 , or even extend up the HLA-A locus 43 . Figure 2 illustrates the frequency of each HLA-G variant (panel A), nucleotide diversity (panel B), SNP density (Panel C), and Tajima's D (panel D) across the HLA-G region, in windows of 500 nucleotides and step size of 1, starting from 4 kb upstream HLA-G (the promoter region) to 200 nucleotides downstream it, considering all 1000Genomes samples pooled together. To evaluate the significance of the parameters estimated from HLA-G, we built a null distribution considering the patterns observed in chromosome 6. For that purpose, we computed these statistics in 10,000 random windows of 500 nucleotides from chromosome 6. The values above the blue and red horizontal lines are higher than 99.9% and 99% of the ones observed in the null distribution for chromosome 6, respectively. The orange line represents the average observed for each statistic. Similar patterns can www.nature.com/scientificreports/ be observed in all biogeographic regions. This analysis has revealed interesting insights. Nucleotide diversity across HLA-G is usually higher than 0.4% (the red line), which is much higher than the average observed from chromosome 6 (0.082%). The regions with the highest nucleotide diversity (around 1%), which are among the 0.1% highest observed in chromosome 6 (blue line), coincides with the promoter region around positions − 3614 (green dot, rs1611163) and − 725 (yellow dot, rs1233334). Another very polymorphic region coincides with intron 2. Likewise, the strongest positive Tajima's D signals in the promoter region, which is compatible with balancing selection, coincide with these two markers. The region surrounding marker − 3614 (rs1611163) presents many frequent variants, including rs1611164, rs1611165, rs1611166, rs1611167, rs2743926, rs1611168, rs11169, in a short segment of 336 bp. Likewise, the region surrounding maker − 725 presents many frequent variants, including rs2249863 (− 716), rs2735022 (− 689), rs35674592 (− 666), rs17875391 (− 646), and rs1632944 (− 633), all previously described elsewhere 40 . The position − 725 is a tri-allelic variant, which has been described as influencing HLA-G expression levels possibly by epigenetic mechanisms 44 . Moreover, except for rs17875391, all these polymorphisms are listed as eQTLs for HLA-G expression by the GTEX portal (https:// gtexp ortal. org). It is not clear whether these variants directly influence HLA-G expression or if they are in LD with other regulatory elements, such as enhancer L, located 12 Kb upstream HLA-G 26 . However, the strongest signal of balancing selection coincides with intron 5 (Fig. 2, panel D), with the Tajima's D values higher than 99.9% of the ones observed for chromosome 6. When we consider the transcribed HLA-G segment, illustrated in the bottom panel at Fig. 2, and the positions and frequencies of variants across HLA-G (panel A), we noticed that most frequent variants coincide with intronic regions and that exon 4 is quite conserved. Table 2 presents the number of segregation sites, nucleotide diversity, and Tajima's D in each HLA-G exon and intron. Introns 1 and 2 present two of the highest nucleotide diversity indexes across HLA-G ( Table 2, Fig. 2) and the highest number of segregation sites in the sliding window analysis (Fig. 2). Some of the intronic SNPs are also considered eQTLs for HLA-G expression, according to GTEX data (https:// gtexp ortal. org). It is not clear whether these intronic variants directly influence HLA-G expression. However, as addressed throughout the following sections, these intronic variants (as haplotypes) are in almost complete LD with some promoter and 3'UTR haplotypes.
On the other hand, the nucleotide diversity drops to low levels in exons, except for exon 1, which is very short ( Table 2) and presents two frequent synonymous variable sites. Additionally, the highest Tajima' D coincides with introns, mostly intron 1, intron 2, and intron 5, all presenting positive Tajima's D, which is compatible with balancing selection. In contrast, all exons present negative values, which is compatible with purifying selection ( Table 2). Exon segments are conserved, and most of the exonic variants are synonymous mutations. Exon 2 presents only two frequent variants, one non-synonymous associated with the allotype G*01:03, and the other a synonymous mutation. Exon 3 presents four common variants, two synonymous mutations, one non-synonymous associated with allotype G*01:04, and one frameshift associated with allotype G*01:05. Exon 4 presents To evaluate the significance of the parameters estimated from HLA-G, we built a null distribution considering the patterns observed in chromosome 6, computing these statistics in 10,000 random windows of 500 nucleotides from chromosome 6. The values above the blue and red horizontal lines are higher than 99.9% and 99% of the ones observed for chromosome 6, respectively. The orange line represents the average observed in chromosome 6 for each statistic. The green and the yellow dots represent polymorphisms -3614 (rs1611163) and − 725 (rs1233334), respectively. www.nature.com/scientificreports/ only three frequent variants, and one configures a non-synonymous mutation associated with allotype G*01:06. Other rare variants in exons are usually synonymous mutations. We also calculated the ratio of substitution rates at non-synonymous and synonymous sites in exons (Table 3). Interestingly, none of the exons, when isolated, presented evidence of purifying selection. However, when pooled together, there is evidence of purifying selection at exons 2, 3, and 4 ( Table 3). This phenomenon was observed in a previous study that evaluated a reduced number of sequences 45 . Taken together, the nucleotide diversity and Tajima's D patterns ( Fig. 2 and Table 2), and the ratio of substitutions (Table 3), made it clear that HLA-G is indeed conserved in exons, although not necessarily conserved in introns and regulatory sequences. Nevertheless, nucleotide diversity for introns and the regulatory sequences are much higher than the human average (0.075%) 46 .
Most of the HLA-G common SNPs are shared worldwide. Because of that, there is a low population differentiation (Fig. 3), except for very isolated populations that may have experienced strong bottleneck effects, such as the Aeta, Batak, and Pima.
The HLA-G coding region diversity. We considered the diversity observed between position −300 and + 838, which is the region tracked by the IPD-IMGT/HLA database 19 , and included, in this survey, 191 variation sites. All coding sequences we have observed, together with their names and global counts, are available for download (Table 1). We detected 190 different HLA-G alleles; 36 of them were already reported (fully or partially) in the IPD-IMGT/HLA database 19 , with a summed frequency of 95.54%. The 154 remaining sequences configure possible new HLA-G alleles. Because these new sequences have not been cloned and submitted to the IPD-IMGT/HLA, we named all new sequences as GEMBIO_HLA-G_G followed by a number that identifies this sequence in the GeMBio laboratory. Interestingly, at least 41 of these new alleles occurred more than once, and some are very frequent, such as GEMBIO_HLA-G_G144, with 51 copies detected in Africa, America, and the Middle East. We also confirmed the existence of many rare HLA-G alleles that have been submitted to the IPD-IMGT/HLA database, such as G*01:01:01:14Q (4 copies), G*01:09 (1 copy), G*01:11 (11 copies), G*01:14 (6 copies), G*01:21 N (2 copies), G*01:26 (2 copies), and others (Table S3).  null allele), and G*01:06:01:01 (encoding the G*01:06 allotype). These alleles represent 91% of all HLA-G sequences detected worldwide, and their frequencies differ in each biogeographic region (Fig. 4), and in each population sample (Table S3). For instance, the frequency of G*01:01:01:01 is high in East Asia, low in Africa, and rare in Oceania, while G*01:01:01:04 is quite rare in Asia and frequent in Africa (Fig. 4). The differentiation among populations, evaluated by F ST , became clearer when evaluated using the haplotypes/genomic alleles (Fig. 5) compared to SNPs (Fig. 3), with African, European, South Asian, and Southeast Asian population samples clustering together. However, most of the population samples amalgamate in the center of the plot, because the common HLA-G alleles are present worldwide (Fig. 5).
There are 74 alleles that have occurred at least twice. For instance, the new genomic allele named GEM-BIO_HLA-G_G128 presents a frequency of 30% among the Aeta and around 10% in Agta and the Bouganville Island samples (Table S3). The hunter-gatherers in the Philippines (Agta) have been linked to Bouganville and  www.nature.com/scientificreports/ Highland Papua New Guinea and Australian hunter-gatherers, possibly associated with a previous out of Africa 47 . The sequence GEMBIO_HLA-G_G31 occurred in many countries, achieving a frequency of 16.7% in the Central African Republic (Table S3). Both these new alleles encode the allotype G*01:01, which is the most frequent worldwide (Fig. 4). As explained earlier, these names represent sequences stored in the GeMBio laboratory Database (www. caste lli-lab. net/ HLA-G). Thus, HLA-G genetic diversity is higher than we currently acknowledge, although only a few different allotypes can be found worldwide. This observation reinforces the conservation in terms of protein diversity, but not necessarily in terms of DNA diversity. In fact, as demonstrated in the previous section, HLA-G nucleotide diversity is much higher than the one observed across chromosome 6, particularly in intronic regions. In addition, there is evidence of purifying selection acting on HLA-G exons, reinforcing the evidence of protein conservation in worldwide populations. We have characterized the pair of full sequences of every individual using the cutting-edge sequencing technology, and we have described here, for the first time, the complete sequences (DNA and protein) of some alleles that are only partially reported in the IPD-IMGT/HLA Database. Additionally, we have provided the sequences of many new HLA-G alleles. Sequence characterization was simplified because many individuals presented the same full-length sequence for alleles G*01:01:14, G*01:01:15, G*01:01:19, G*01:04:05, G*01:11, and G*01:14, which are only partially characterized in the IPD-IMGT/HLA Database. Likewise, some new sequences occurred in more than ten individuals. This worldwide data clearly reveals that the IPD-IMGT/HLA is outdated in terms of stored data for HLA-G. It should be emphasized that, as new NGS initiatives grow, invaluable high-quality and large-scaled HLA data will be shortly produced. Thus, it is of utmost importance that the IPD-IMGT/HLA considers these data to update the official database.
Some allotypes are particularly frequent in specific population samples (Table S4, Fig. 4). G*01:01 is the most frequent in all regions, except Oceania. G*01:06, for instance, is the product of a nucleotide exchange + 1799 C > T at codon 258 in exon 4, exchanging Threonine/Methionine (rs12722482, Table S1), and is frequent in the Middle East and South Asia, with intermediate frequencies in Europe and America, and almost absent in East Asia and Africa. Likewise, this allele was frequent in Cyprus 30 , absent in other Amerindian tribes from Brazil 45 and Benin 28 , and very rare in other Amerindians from South America 48 . G*01:06 was previously associated with a risk for allograft rejection 49 and pregnancy complications 31,50,51 , albeit the mechanisms underlying these associations are unknown. G*01:03, which results from an A > T exchange at position + 292 at exon 2, in codon 31, exchanging a Threonine for Serine (rs41551813, Table S1), is common among Africans, Americans (including Brazil), and in the Middle East, but rare among other biogeographic regions. Conversely, it is the most frequent allele among Karitiana (Table S4) and quite frequent in other Amerindian tribes from the Brazilian Amazon 45 . The frequency of G*01:04, which is a consequence of the nucleotide exchange + 755 C > A at exon 3, exchanging Leucine/Isoleucine (rs12722477 , Table S1), is particularly high in Asia and Oceania, especially among Japanese. It is also high in Africa, Agta, and many Amerindians from South America 28,45,48 and less frequent in Europe. G*01:05 N, a truncated HLA-G protein caused by a deletion at exon 3 leading to a premature stop codon further in intron 4 (rs41557518 , Table S1), is rare in most regions but frequent in Africa. European populations present the highest  www.nature.com/scientificreports/ frequency of allotype G*01:01, and usually the lowest frequency of other allotypes. Because of that, Europeans (especially Finns) present the lowest observed heterozygosity for allotypes across the world (data not shown). Some HLA-G allotypes only occurred in specific populations. For example, G*01:11 occurred among Caribbeans, African Americans, Puerto Ricans, and Brazilians, and G*01:14 among Caribbeans, Sierra Leoneans, and African Americans. Moreover, we detected at least 40 new allotypes. These new allotypes are usually rare and restricted to single populations. However, some have occurred more than once, such as the new allotype here named GEMBIO_HLA-G_P5 (P for protein sequence), which occurred in many samples from South East Asia and Oceania (Table S4), and it is a truncated protein sequence. The sequence of each allotype is available as a resource for download (Table 1).
In Fig. 6, we highlight all the important amino acid residues that may influence HLA-G function, and the polymorphic residues for allotypes that occurred at least twice in our dataset. Among these critical residues for HLA-G function, we may cite the Cysteine at the α1 domain (position 42 in the mature protein), which is responsible for dimer formation (C42-C42) among HLA-G membrane-bound and soluble isoforms, and dimers interact with more affinity with HLA-G receptors 4 . The Methionine and Glutamine (76 and 79 at the mature protein, in green in Fig. 6) residues, also at the α1 domain, interact with KIR2DL4 52 . The residues 195 and 197, at the α3 domain, interact with ILT-2 and ILT-4 leukocyte receptors (purple and yellow in Fig. 7, respectively) 53 . The motif DQTQDVE, also at the α3 domain, interacts with the TCD8 receptor (in blue in Fig. 6) 54 . As can be observed, there is no variability at residues associated with HLA-G major biological properties in a worldwide sample. Note that truncated proteins lacking the α3 domain (G*01:05 N, G*01:21 N, and some new allotypes) present limited interaction with HLA-G receptors. However, the global frequency of these truncated allotypes is low (2.6%), mostly represented by G*01:05 N.
The HLA-G 3' untranslated region. The 3' untranslated region (3'UTR) of most of the mRNA transcripts is the primary target for microRNAs (miRNA), which are small non-coding RNA molecules that posttranscriptionally modulate gene expression levels. MiRNAs modulate gene expression by two different mechanisms, mRNA degradation and translation blockage 55 . Variants in both miRNA coding gene and miRNA target sequence may influence miRNA/mRNA binding strength and thus influence gene expression levels 56 .
The HLA-G 3'UTR segment was fully characterized in 2010 by Castelli and colleagues, analyzing a large admixed Brazilian sample using Sanger sequencing 41 . There were eight frequent haplotypes in that survey, named UTR-01 to UTR-08, in order of frequency. Each of these 3'UTR haplotypes was in LD with specific HLA-G coding region alleles. The existence of these frequent haplotypes and their associated alleles were further confirmed by many other studies, using Sanger sequencing and NGS in different populations 24,27,28,30,40,[57][58][59][60][61][62][63][64][65][66][67] . There is a high LD between haplotypes from the HLA-G 3'UTR and the coding region (Fig. 1, Table 4) Table S5. However, as initially described in 2010, only eight of these haplotypes are globally frequent (Fig. 4), with frequencies varying in different biogeographic regions. UTR-01 and UTR-02 are common in Europe, Africa, America, and parts of Asia, and less common in Southeast Asia and Oceania. UTR-01 and UTR-02 are the most divergent ones, but they are in LD with the HLA-G*01:01:01:01 and G*01:01:02:01 coding alleles (encoding the same G*01:01 allotype), which in turn are also the most divergent coding region alleles. Conversely, the frequencies of UTR-03 and -07 follow the opposite path in terms of worldwide distribution, with the first associated with allotype G*01:04 and the last with G*01:01. UTR-07 is absent in Africa (Fig. 4).
The HLA-G 3'UTR presents at least three variants that have been evaluated in terms of functional studies. Undoubtedly, the most studied is the presence or absence of a 14-bp fragment at the beginning of the last HLA-G exon, commonly referred to as the 14 bp INS/DEL (rs371194629) polymorphism. The ancestral allele is the presence of this sequence, which is detectable in most primates. The supplementary alignment provides its status in each 3'UTR sequence. The DEL allele is associated with the frequent UTR-01, UTR-03, UTR-04, UTR-06, and UTR-18 haplotypes (Fig. 4), among others. The ancestral allele (INS, presence) was associated with lower HLA-G production of the membrane-bound and soluble isoforms [68][69][70][71] , while other studies pointed the opposite 72,73 . Previous studies indicated that the 14-base sequence (or another variant in linkage with it) triggers alternative splicing removing from the mature HLA-G mRNA the first 92 bases of the last exon, which also includes the 14b sequence 74 . These alternative mRNA seems to be more stable 75 , but the fraction of mRNA that undergoes this alternative splicing varies in different cell lines and tissues. Moreover, the 14-bp polymorphism may influence the binding of a few miRNAs that target the 14b sequence directly, in addition to various miRNAs that target the 92b segment that may be spliced out when the 14b sequence is present 76,77 . However, with the exception of miR-133a 78 , none of these interactions have been proven functionally.
The + 3142 G > C variant (rs1063320) influences the binding of some miRNAs, with Guanine allele favoring the targeting of three miRNAs, miR-148a-3p, miR-148b-3p, and miR-152 56,76 . However, while studies confirm that these miRNAs down-regulate HLA-G expression 79,80 , apparently, they down-regulate HLA-G irrespectively of the nucleotide at position + 3142 80 . Interestingly, miR-148a-3p and miR-148b also influence HLA-C expression, and both genes are co-expressed in the placenta during pregnancy 81 . The allele associated with lower miRNA www.nature.com/scientificreports/ binding, + 3142C, can be found worldwide and co-occurs with the frequent UTR-01, UTR-04, UTR-06, and UTR-18 haplotypes (Fig. 4, supplementary alignment). The + 3187 A > G variant (rs9380142) coincides with an AU-rich motif in the HLA-G 3'UTR that may trigger HLA-G mRNA degradation. The presence of Guanine in this position reduces the size of this AU-rich motif, Table 4. The relationship between HLA-G alleles and 3'UTR haplotypes, for combinations that have occurred at least twice in 4640 individuals across the globe. a The global frequency of this haplotype. b Considering all the 3'UTR haplotypes associated with a given genomic allele, this is the frequency in which this specific 3'UTR haplotype follows the given genomic allele. An internal frequency of 1.000 indicates that the given genomic allele always presents the same 3'UTR sequence. www.nature.com/scientificreports/ positively influencing mRNA stability and associated with higher HLA-G expression 82 . The only frequent 3'UTR carrying the high-expressing allele (Guanine) is UTR-01. Many studies addressed the influence of the 3'UTR sequence on the HLA-G expression levels of both membrane-bound and soluble isoforms. Since HLA-G presents a high LD (Fig. 1), any study addressing HLA-G expression profile and polymorphism must consider how these polymorphisms interact with each other as haplotypes. For instance, while UTR-04 and UTR-06 present the high-expressing + 3142 > C allele, they also carry the low-expressing + 3187 > A allele. The 14 bp INS allele usually co-occurs with the low expressing + 3142 > G and + 3187 > A alleles, with rare exceptions. The most divergent haplotypes, UTR-01 and UTR-02, differ in all the variants functionally reported to influence HLA-G expression and in two other positions (supplementary alignment). The coding alleles associated with these haplotypes encode the same HLA-G*01:01 allotype. It seems that the frequent 3'UTR haplotypes detected in modern humans somehow present polymorphisms that tune HLA-G expression in a dependent manner, possibly resulting in either high or low expression levels depending on the stimuli and the physiological microenvironment. The influence of specific haplotypes on the HLA-G expression levels has already been demonstrated functionally 24 .

Linkage disequilibrium between HLA-G and HLA-A.
HLA-A is a classical class I gene that encodes a key molecule for antigen presentation. HLA-A is the second most variable histocompatibility gene, with more than 6000 alleles already described in the IPD-IMGT/HLA Database. As antigen presentation is the main function of HLA-A, it is common sense that natural selection has shaped the genetic variability in its coding region, increasing variability at the peptide binding site, thus allowing a great diversity of antigen presentation 83,84 .
HLA-A and HLA-G are only 100 Kb apart from each other. This proximity results in high LD and hitchhiking effects might even involve common regulatory regions. Linkage disequilibrium between HLA-G and HLA-A has been previously demonstrated using low HLA-A typing resolution 43 . Here, we demonstrate the LD level between these two genes by using forefront technology. First, we processed the region encompassing HLA-G and HLA-A using the same pipeline, as described in the supplementary methods. Then, we removed variants laying on regions with known structural variants and rare variants (MAF < 1%), resulting in 1873 frequent SNPs covering the HLA-G, HLA-A, and their intergenic region. With this data, we calculated the r 2 between pairs of bi-allelic SNPs (Fig. 7). Balancing selection has been well documented for both HLA-A 83,85 and HLA-G 86 , but at different fashions. While the most polymorphic segment at HLA-A is the coding sequence, associated with the presence of frequent and divergent coding alleles in worldwide populations, the most polymorphic sites at HLA-G are the regulatory regions (i.e., promoter, introns, and the 3'UTR, Fig. 2). Many studies have detected signatures of balancing selection at the HLA-G promoter and 3'UTR 40,66,86,87 , while the coding segment is conserved, as we demonstrated here, being under purifying selection. However, it is not clear whether balancing selection is indeed operating at the HLA-G promoter (for instance, in the region encompassing polymorphism − 725), or whether these findings are due to a hitchhiking effect caused by the selective pressures acting the neighboring HLA-A gene.
Since HLA-A is a crucial molecule for triggering adaptive immune responses, HLA-A is usually a target for natural selection. Balancing selection is an important force that maintains advantageous genetic diversity in populations, including variations responsible for long-term adaptation to the environment 88 . For the coding region, it is well established that balancing selection deals with an excess of non-synonymous substitutions and an increase of functionally relevant variability.
A previous study from our group has proposed that the HLA-A locus is not influencing HLA-G, and therefore HLA-G is a direct target of selection 87 . However, the study above addressed this issue using the 1000Genomes phase 1 data, which was the only dataset available at the time. Since then, several studies have revealed that the 1000genomes phase I dataset at the HLA region is strongly affected by mapping bias. This mapping bias, together www.nature.com/scientificreports/ with the fact that those samples were sequenced in a low coverage fashion, and no specific tools for HLA were applied, might have led to a wrong conclusion that HLA-A does not influence HLA-G.
Here, we demonstrate a strong LD between HLA-G and HLA-A in a way that the most frequent HLA-G alleles are associated with specific HLA-A alleles. Because there is a strong LD between any HLA-G allele and specific promoter and 3'UTR sequences, balancing selection operating at the HLA-A locus may also shapes the HLA-G promoter frequencies, downplaying its direct role on HLA-G. The same rationale can be applied to the HLA-G intronic sequences. High heterozygosity in HLA-A, which is commonly observed in worldwide populations, would lead to high heterozygosity in the HLA-G coding region (at the DNA level) and, thus, to high heterozygosity in the HLA-G promoter and intronic regions 30,40 . We cannot disregard the possibility that part of the genetic signature of balancing selection on HLA-A is due to HLA-G since the signatures of Natural Selection observed in both genes are not independent, as previous stated.
Implication of the HLA-G variability in worldwide populations. After fully sequencing HLA-G and its promoter region in 4640 individuals from 88 different populations samples, we have concluded that this gene is highly conserved in the protein level, with no variation in the residues responsible for the biological function of the protein, i.e., dimer formation, alternative splicing, peptide coupling, and interaction with HLA-G receptors. Considering that HLA-G dimer interaction with HLA-G receptors is more stable than with monomers 17,53 , and considering that dimer formation primarily depends on the disulfide bridge formed between Cysteine residues (C42-C42) 53 , as seen in Fig. 6, all complete proteins detected in worldwide samples exhibit Cysteine at residue 42. Then, theoretically, all proteins and the most common HLA-G isoforms exhibiting the α1 domain may form dimers. Since the mechanisms of HLA-G alternative splicing have not been completely defined, the frequent variants in the HLA-G introns may influence splicing, yielding different isoforms with different properties. In this context, some of these isoforms may not present the α1 domain 17 , avoiding dimer formation. Most of the HLA-G coding region variability coincides with introns ( Fig. 2 and Table 2), and it is possible that these variants influence HLA-G regulation. All proteins we have detected exhibit no variability at residue motifs described to interact with ILT-2 (Phe195), ILT-4 (Phe195, Tyr197), KIR2DL4 (Met76, Gln79) and CD8 (DQTQDVE, residues 223-229) 2,54,89 . Nevertheless, isoforms presenting an incomplete α1 domain cannot interact with KIR2DL4, particularly the truncated HLA-G proteins encoded by the HLA-G*01:05 N, *01:13 N, and 01:21 N alleles (Fig. 6), as well as with ILTs and CD8 receptors.
Because HLA-G is a promiscuous immune checkpoint molecule, the modulation of its function may be tuned according to specific therapeutical approaches. The use of monoclonal antibodies against HLA-G in situations in which its expression is undesirable (for instance, in cancers and chronic viral disorders) would have many side effects since HLA-G is constitutively expressed in thymus, pancreas, and hematopoietic stem cells 17 . On the other hand, the use of recombinant HLA-G would be a plausible alternative for allograft and autoimmune disease treatment. Since few coding allotypes (proteins) are most frequently observed in worldwide populations, the production of a small number of recombinant HLA-G proteins would be sufficient to individualize treatment, according to the patient HLA-G coding region typing. Besides targeting the molecule, it is also possible to modulate its expression taking advantage of the knowledge of the target sites for transcription and post-transcription factors that act on the HLA-G gene expression. Considering the canonical and non-canonical miRNA and transcription factor 22,24,56,76,[90][91][92][93] targeting HLA-G, a comprehensive survey of the HLA-G genetic diversity as provided in this study may help to unveil regulatory factors and strategies to differentially modulate HLA-G production whenever necessary.
Taken together, the results of this investigation indicate that: (1) most of the HLA-G polymorphic sites are located in the regulatory and intronic regions, (2) although some promoter segments present the highest nucleotide diversity and Tajima's D, the strongest signals of balancing selection come from intronic segments, particularly intron 5 (3) a high LD is observed across HLA-G, from 4 Kb upstream the gene up to the HLA-A locus, (4) it is possible that balancing selection acting on HLA-A may be influencing the high heterozygosity observed in the regulatory and intronic sequences of HLA-G, and vice-versa, (5) the presence of few proteins frequently observed in worldwide populations corroborates the role of the HLA-G as an immune checkpoint molecule rather than as an antigen-presenting molecule; (6) the lack of variation in residues responsible for major HLA-G biological functions reinforces the immunomodulatory properties of the molecule, maintained throughout evolution; (7) most common variation sites within HLA-G regulatory and coding region have been maintained as haplotypes, as observed in worldwide populations, (8) motifs associated with HLA-G interaction with receptors may be engineered modified to impede or enhance receptor interaction according to the underlying disorder, (9) possible therapeutic use of HLA-G as an immune checkpoint agent may be accomplished with the production of a limited number of recombinant proteins.

Concluding remarks
This is the most comprehensive study of the genetic diversity involving HLA-G performed so far. This large dataset, composed of populations from different ancestries and biogeographical regions and using sequencing data produced with cutting-edge sequencing technology and bioinformatics pipelines, may provide a broad picture of worldwide patterns of HLA-G haplotype distribution and linkage disequilibrium. This landmark achievement on HLA-G genetic diversity that will hardly be overridden in the near future provides all the basis for in-depth association and functional studies, which is of utmost importance given the immunomodulatory properties of the molecule.