Human leukocyte antigen super-locus: nexus of genomic supergenes, SNPs, indels, transcripts, and haplotypes

The human Major Histocompatibility Complex (MHC) or Human Leukocyte Antigen (HLA) super-locus is a highly polymorphic genomic region that encodes more than 140 coding genes including the transplantation and immune regulatory molecules. It receives special attention for genetic investigation because of its important role in the regulation of innate and adaptive immune responses and its strong association with numerous infectious and/or autoimmune diseases. In recent years, MHC genotyping and haplotyping using Sanger sequencing and next-generation sequencing (NGS) methods have produced many hundreds of genomic sequences of the HLA super-locus for comparative studies of the genetic architecture and diversity between the same and different haplotypes. In this special issue on ‘The Current Landscape of HLA Genomics and Genetics’, we provide a short review of some of the recent analytical developments used to investigate the SNP polymorphisms, structural variants (indels), transcription and haplotypes of the HLA super-locus. This review highlights the importance of using reference cell-lines, population studies, and NGS methods to improve and update our understanding of the mechanisms, architectural structures and combinations of human MHC genomic alleles (SNPs and indels) that better define and characterise haplotypes and their association with various phenotypes and diseases.


INTRODUCTION
The human Major Histocompatibility Complex (MHC) on the short arm of chromosome 6 (band p21.3) is a Human Leukocyte Antigen (HLA) super-locus composed of clusters of many tightly linked supergenes involved with various phenotypic functions, mostly in connection with the immune response [1][2][3][4] . The MHC genes are defined as supergenes on the basis that they are clusters of tightly linked functional genetic elements spanning hundreds of kilobases that control complex balanced phenotypes and are inherited as a unit [haplotype] owing to reduced or absent recombination within them 5 , and because many have evolved by genomic duplications, deletions and inversions 6 . Although the most common mechanism of supergene formation is considered to be by inversion 7,8 , in which single crossovers between heterozygotes may lead to unbalanced gametes, the MHC genomic organisation reveals a variety of haplotypes with segmental duplications [9][10][11] , and structurally variant loci such as C4 and DRB 12 , and a variety of duplicated repeat elements 6,13,14 , that exist possibly due to balancing selection 15,16 . These duplicated and inverted homologues probably generate recombinant haplotypes by varying rates of non-allelic and allelic homologous and nonhomologous recombinations and crossovers 12,17 . Thus, finding reliable phenotypic associations by genome-wide association studies (GWAS) is complicated and masked by the presence of hundreds of interlinked genes and regulatory elements in strong linkage disequilibrium (LD) within the super-locus [18][19][20] .
The HLA super-locus is characterised specifically by twelve classical class I and class II genes that encode antigen-presenting HLA proteins that present host (self) or foreign (nonself) peptides to interact with T-cell receptors in order to discriminate between self and nonself as part of the host immune response 3,[20][21][22][23] . This is an important immunogenetic regulatory region 24 of~4 Mb in length with more than 120 non-HLA genes that together with the classical and non-classical HLA genes have been associated with more diseases than probably any other region of the human genome 1,2,12,25 . It is one of the most complex and diverse genomic regions with high levels of polymorphism, gene duplications, repeat elements, structural variations (indels), and long-range haplotype segments or blocks known as Conserved Extended Haplotypes (CEHs) 18 or Ancestral Haplotypes (AHs) 10 . The diversity of the variable long-range haplotype segments within heterozygote individuals has provided problems and challenges for assigning SNPs to loci, and assembling structural variants of numerous duplicated genes particular in regard to associating them as genetic markers or causative agents for many of the immune-related phenotypes and diseases 18 . In recent years, more attention is being given to gaining a better understanding of MHC haplotypes by phased long-range sequencing as an extension of genotyping and identifying genic and non-genic alleles for associating them with disease, bone marrow transplantation, and for ascertaining the effects of immunotherapy 26 . Reliable MHC linkage mapping and haplotyping usually are dependent on pedigree studies of particular genotyped markers to evaluate their linkage or segregation in meiosis 18 or on phased genomic sequences 26 , such as those that have been sequenced or genotyped using multilocus HLA-captured haplotype phasing 27,28 , de novo assembled trios 29 , MHC homozygous cell-lines 11 , sperm 30 or single chromosomes 31 . Because of the complexity of the MHC as a HLA super-locus with a myriad of interconnected gene systems and sub-genomic regions, it is a gradual and continuing difficult process to build up the genetic, molecular and functional knowledge about the architectural and functional organisation of haplotypes in this region and their overall contribution to health and disease 1,2,25,26,32,33 .
In this brief review, we outline some of the recent analytical developments used to investigate the SNP polymorphisms, J.K. Kulski et al. structural variants (indels), expression quantitative trait locus (eQTL) and haplotypes of the HLA super-locus. We highlight the importance of using reference cell-lines, population studies and next-generation sequencing (NGS) methods to overcome past problems and to improve and update our understanding of the mechanisms and architectural structures and combinations of human MHC genomic alleles (SNPs) that better define and characterise haplotypes, and their association with various phenotypes and diseases.

MHC genomic sequence and subdivisions of structural organisation
The first fully sequenced and gene annotated human genomic MHC was published in 1999 using the pioneering Sanger sequencing technology 34 . This primary sequence was a 'virtual MHC' composed of a mosaic of different human haplotypes rather than presenting any one particular haplotype. Subsequently, the first generation genomic sequences of eight human ancestral MHC haplotypes were published for a more precise comparative genomic analysis of the similarities and differences between different haplotypes 35 . Figure 1 shows the gene map of the HLA genomic region based on Genome Reference Consortium Human Build 38 patch release 14 (GRCh38.p14) in the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov/genome/?term=human) and the MHC-PGF haplotype, one of the eight MHC haplotypes sequenced by the MHC Haplotype Consortium (Fig. 1A) 35 . The MHC genomic organisation has a high degree of evolutionary complexity with the remnants of many homologous segmental duplications 6 as well as inversions (Fig. 1B); probably turned over and shuffled by many different ancestral hominoid haplotypes as a result of non-allelic and allelic homologous recombination, gene conversion (nonhomologous recombination) and sequence crossover between different homozygotes or heterozygotes (Fig. 1C).
The HLA super-locus is divided into three regions related to the functions and distributions of the duplicated HLA genes and pseudogenes; the class I region located at the telomeric end and the class II region at the centromeric end, both separated from each other by an extended class III region of 61 protein-coding genes 1,2 . Whereas the HLA class I and class II genomic regions encode the highly polymorphic gene complex of the HLA class I and HLA class II genes, the class III region consists of many different non-HLA genes that are involved in stress response (HSPA1A, HSPA1B and HSPA1L), complement cascade (C4A, C4B, C2, CFB), immune regulation (NFKBIL1, FXBPL and DDX39B), inflammation (LTA, LTB, LST1, ABCF1, AIF1, NCR3 and TNF), leukocyte maturation (LY6G5B, LY6GSC, LY6G6D, LY6G6E and LY6G6C), and regulation of T cell development and differentiation (BTNL2) 4,36 . Recently, Zhou et al. showed that a quartet of MHC class III genes (NELF-E, SKIV2L, DXO and STK19) are involved with the metabolism and surveillance of RNA during the transcriptional and translational processes of gene expression 37 . The class II region also contains some proteosome-processing and peptide antigen transportation non-HLA genes such as PSMB8, PSMB9, TAP1, and TAP2. The TAP-binding protein, TAPBP, is in the extended class II region. The 'Class I' region (telomeric to centromeric ends) ranges from HLA-F to MICB, 'Class III' from PPIAP9 to BTNL2, and 'Class II' from HLA-DRA to HLA-DPA3. There also are sub-regions from the telomeric side of Class I and the centromeric side of Class II that are called the 'Extended class I' (telomeric side of HCG4P11) and 'Extended class II' (centromeric side of COL11A2) regions, respectively. The class I region has been divided into three genomic blocks, alpha, beta and kappa 6,10,38 , that include duplicated HLA genes on either side of two intervening blocks of framework (FW1 and FW2) genes ( Fig. 1A) that include non-HLA genes 39 . HLA-A, -G and -F are in the alpha block, HLA-B and -C are in the beta block, and HLA-E is in the kappa block.
A total of 283 loci were identified and/or reclassified in the 3.78-Mb HLA genomic region of the PGF haplotype from GABBR1 located on the extended class I region to KIFC1 located on the extended class II region ( Fig. 1A and Table 1). When all the loci of the HLA genomic region are grouped into four categories of gene types, then 144 loci are classified as a protein-coding gene, 53 loci are non-coding RNA (ncRNA), five loci are small nucleolar RNA (snoRNA) and 81 loci are pseudogenes (Table 1). Of the 283 loci, 15.5% (44 loci) are occupied by HLA and HLA-like genes (HLA class I, HLA class II and MHC class I polypeptide-related sequences or MIC genes). However, the genic and non-genic numbers in Table 1 are not absolute for the MHC genomic region because of haplotype differences that may involve structural variations due to duplications, deletions, and insertions.
Of the HLA and HLA-like genes, 18 HLA class I genes (six protein-coding genes and 12 pseudogenes) (Fig. 1B) and 7 MIC genes (two protein-coding genes and five pseudogenes) are located in the HLA class I region, and 18 HLA class II genes (13 protein-coding genes and five pseudogenes) are in the HLA class II region ( Fig. 1A and Table 2). Also, one HLA class I 88-bp pseudogene (HLA-Z) is located within the ncRNA gene LOC100294145 close to the HLA-DMB gene in the HLA class II region. The classical HLA class I genes, HLA-A, -B and -C, and the classical HLA class II genes, HLA-DR, -DQ and -DP, are characterised by their extraordinary polymorphisms, whereas the non-classical HLA class I genes, HLA-E, -F and -G, are differentiated by their tissue-specific expression and limited polymorphism ( Table 2).
Apart from the protein coding genes, pseudogenes, non-coding transcribed RNA loci, and small nucleolar transcribed RNAs (snoRNAs) loci, there are at least 8604 repeat elements including those known Fig. 1 Human MHC genomic map, HLA gene duplications and haplotypic crossovers during meiosis. A Gene map of the HLA genomic region that corresponds to the genomic coordinates of 29602228 (GABBR1) to 33410226 (KIFC1) on chromosome 6 in the human genome GRCh38.p13 primary assembly of the NCBI map viewer. The regions separated by arrows show the HLA sub-regions such as extended class I, class I, class III, classical class II and extended class II regions from telomere (left and top side) to centromere (right and bottom side). The red and blue double, horizontal arrows show the spans of alpha, kappa and beta blocks, and framework (FW), non-HLA gene blocks, FW1 and FW2, respectively. The Class III region is composed of non-HLA genes or FW genes, but is known traditionally as Class III. White or coloured (orange, red and blue) boxes, grey, and black boxes show protein-coding genes, non-coding RNAs (ncRNAs), and pseudogenes, respectively. Red, green and blue letters indicate HLA class I, MIC, and class II genes, respectively. Adapted from Shiina et al. 1,2 . B Cis and trans structural orientation of duplicated and inverted HLA class I and class II genes within their duplication blocks (alpha, kappa, beta, delta and epsilon) relative to the telomeric and centromeric ends (left to right, respectively) of the HLA super-locus. The duplicated MIC pseudogenes, MICE, MICG, MICF, MICD, MICC, and their locations in the alpha block (A) are not shown in B. All MIC pseudogenes and genes in the MHC genomic region are coded in the opposite direction to all the HLA class I genes and pseudogenes 6 . Solid arrows indicate the 5′ to 3′ direction of coding genes and dotted arrows indicate the 5′ to 3′ direction of pseudogenes (italicised). The structural variants for the HLA-DRB3, DRB4 and DRB5 genes in the class II region and the C4 genes in the class III region are indicated by the enclosed vertical boxes. The location and distribution of the duplicated genes are not shown to exact genomic scale. C Chromosomal or SNP density crossover (XO) junction in comparisons between two homozygous haplotype pairs (AB/AB and ab/ab) and two heterozygous recombinant (haplotype) pairs (AB/Ab and aB/ab). Chromosomal recombination is shown with a XO located between loci A and B within haplotype region 'AB' and between loci a and b within haplotype region 'ab' in a diploid cell during meiosis.  as transposable elements (TEs) and/or retroelements, and 723 simple repeats (microsatellites) in the MHC PGF haplotype sequence. Table 3 lists the main families of repeat elements identified and classified by RepeatMasker (http://www.repeatmasker.org) as a percentage of genomic sequence both within the intervening sub-regions, and within the entire MHC region from HLA-F to HLA-DPA3. The SINEs that congregated mainly in FW2 (26%) and class III (21%) regions were lowest in the alpha, kappa, beta, and class II blocks at <10%. The LINEs, mostly fragmented and of the mammalian L1M types, were found at highest percentage in the kappa block (31%), and within the beta block, FW1, and class II region, each at 26%. The ERVL subfamily of the LTR family were in the alpha and beta blocks at least at three to ten times higher percentage than within the other subregions. The LTR and ERVL were highest in the alpha block (25% and 13%, respectively) and lowest in the class III region (4% and 0.3%, respectively). Many of the LTR/HERVs form the building blocks of the transcriptional regulatory elements 40 , and their relatively high content in the alpha and beta blocks (Table 3) may reflect a role in the duplication of the HLA genes within the MHC 6,41-44 . The overall total percentage of the interspersed repeat elements (IREs) was highest in the beta (61%) and alpha (58%) blocks and lowest in the class III region (41%). On the other hand, the class III region and FW2 had the highest GC level percentage at 49% and 48%, respectively, possibly reflecting the greater density of coding genes within these two regions.
Homozygous cell-lines as MHC genomic sequence haplotype references Haplotypes at the genomic sequence level are blocks of phased coding and non-coding nucleotide sequences of multiple loci that are in the same orientation (cis) as their mode of gene transcription and regulation 26 . The characterisation and understanding of MHC haplotypes in modern disease and population genetics began in 1967 with the introduction of the word 'haplotype' by Ruggero Ceppellini to describe alleles in the HLA system 45 , and expanded in the 1990s with the pedigree studies of the research groups of Alper 9,18 , and Dawkins 10,46,47 . Since then, the International Histocompatibility Workshop Group (IHWG) has provided at least a thousand commercially available cell-line samples from HLA heterozygous and homozygous donors, families, and diverse populations (https://www.fredhutch.org/en/ research/institutes-networks-ircs/international-histocompatibilityworking-group.html) that are important for research into MHC immunogenetics, comparative genomics, transcriptomics and haplomics 11,18,28,35,46,47 These genotyped or fully sequenced MHC haplotypes provide standardised references to assist with the design and interpretation of HLA genotyped population studies and HLA-disease relationships. The genotyped cell-lines also provide excellent insights into the structural organisation of MHC phased haplotypes 11 , not previously available for detailed comparative analysis by just using blood or tissues samples collected from diploid heterozygous individuals. The first MHC genomic sequence variations in different haplotypes were produced by the Sanger Centre MHC Haplotype Project (SCMHP) using eight homozygous cell-lines 35 . These now are alternative reference sequences as part of the human reference genome GRCh38 48 . Initially, only two haplotypes were resolved completely at the base pair level (cell-lines PGF and COX); whereas the other six haplotypes were completed only at 51% (cell-line APD) to 93% (cell-line QBL) of the MHC genomic region. Seven of the SCMHP cell-lines were resequenced again as part of 95 near-complete haplotypes, using short-range and long-range NGS 11,49 . Overall, Norman et al. provided 137 genotyped loci for most of the 95 celllines that they sequenced 11 . Table 4 shows the diversity of 68 different haplotypes at six HLA class I and class II loci for eight cell-lines sequenced by the SCMHP, FW1 and FW2 indicate framework gene (non-HLA genes) segment 1 and segment 2, respectively, within the MHC class I region located between the alpha and beta blocks (Fig. 1A).
J.K. Kulski et al.  [12][13][14]50 . The AH/CEH nomenclature in Table 4 is taken from Dorak et al. 47 . The AH names use the B allele and if two or more AH carry the same B allele then sequential numbers are added to indicated the order of discovery, such as AH7.1 and AH7.2 47 . In Table 4, four different cell-lines (PGF, SCHU, HO104, LD2B) 11 47 are different from each other at all the six loci except at HLA-C where they are both C*07; although they actually are different from each other at the two digital allelic level, C*07:02 and C*07:01, respectively. This two digital allelic difference represents the two amino acid difference between the HLA-C proteins for AH7.1 (PGF) and AH8.1 (COX) with K90N in exon 2 and S125Y in exon 3. Comparatively, most of the 68 haplotypes in the Norman et al. 11 study are hybrids or recombinants that are different at one or more loci, but share the same alleles possibly at other loci. For example, the ten haplotypes with the allele A*01:01:01:01 at the HLA-A locus are different at one or more of the other five loci. However, some of these A*01 haplotypes have the same alleles at other loci. There The haplotypes in (A) and (B) were sorted according to the HLA-A allele in descending order. The AH nomenclature is taken from Dorak et al. 47 , which is based on the initial definitions by Dawkins et al. 10 and Alper et al. 9,18 , whereby the AHs are also called CEHs. The AHs are named using the B allele, and if two or more AHs carry the same B allele then sequential numbers are added to indicate their order of discovery, such as AH7.1 and AH7.2. The 'x' after the B allele implies that the sequential number is not known, and therefore needs to be updated. A blank space in the AH column indicates that the AH designation is not known or updated in the literature. Norman et al. 11 have provided the names of the cell-lines for each of the haplotypes sequenced, but we have not added them to this  13,14 suggest that the haplotypic block or segmental SNP patterns with genomic sequence crossovers (Fig. 2) probably evolved ancestrally using recombination mechanisms 17 . Conserved and hybrid haplotypes are likely to have accumulated in interrelated populations or ethnic groups in relatively recent times, possibly over a few thousand generations or more 52 . These shuffling or recombination mechanisms are delineated also as SNP diversity plots in sequence alignments between two phased MHC genomic regions (Fig. 2).
Haplotype SNP diversity plots and crossover junctions. Figure 2 shows SNP diversity plots in nucleotide DNA comparisons between the same and different human MHC haplotypes as well as to that of a chimpanzee haplotype sequence. SNPs are the nucleotide sequence differences seen between two different phased haplotypes that have been aligned ( Fig. 2A, E, F). Sequence alignments between different haplotypes (heterozygous sequences) reveal varying SNP densities (number of SNPs per kb) across the entire MHC with the greatest SNP densities occurring in the alpha block within the HLA-A gene region; the HLA-B and -C genes of the beta block; the delta block with HLA-DRB1, -DQA1 and -DQB1; and the epsilon block involving HLA-DPB1. Unsurprisingly, the highest SNP density peaks occur in the regions of the HLA classical class I and class II genes that correlate positively with the overall number of alleles detected for the different HLA gene loci (Table 2). In comparison, the SNP densities are consistently at low levels in the non-HLA genetic regions such as those between the alpha and beta blocks in the class I region, and in the class III region where the number of alleles for each of the class III genes are often <20, and comparable to the allele numbers detected for non-classical HLA genes, like HLA-F, and HLA pseudogenes ( Table 2). Fewer SNPs are detected between two aligned homologous or highly similar sequences (e.g., Fig. 2B, PGF versus LD2B) than between different haplotypes (e.g., Fig. 2A, PGF v COX) because they are identical by descent with no recombination. However, some nucleotide differences either as de novo mutations and/or sequencing or assembly errors are evident across the alignment between fully matched HLA loci (conserved haplotypes). In contrast, sequence alignments of recombinant haplotypes (e.g., Fig. 2C-E) reveal an extended sequence block that is rich in SNPs adjoining an extended block of homologous sequences with no or few SNPs (labelled as a SNP poor or SP) that are seen to be SNP rich in other haplotype comparisons ( Fig. 2A). The junction between the SNP rich and SNP poor blocks are the SNP crossover junctions suggesting that they are in close proximity to chromosomal recombination crossover regions 13,14 , as outlined in Fig. 1C. With recombinations and crossovers, a considerable amount of opportunistic hitchhiking may occur particularly near the HLA loci 53 , and with the integration and rearrangement of Alu, LTR and HERV elements 54 .
Supergene expression, eQTL, epistasis and disease. Since undertaking our earlier analyses of MHC gene variants, epistatic interactions, expression activity and associations with various diseases taken from publications and records in public databases such as the Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM) and the Genetic Association Database (GAD) 1,2 , these types of genome-wide MHC association studies have progressed much further with the more formidable bioinformatic analyses of phenotype associations, known as MHC PheWAS 55 . However, regulatory elements can act over long distances and in a cell-type specific manner that hamper the easy identification of the causal genes for a given pathological condition 56,57 . In this regard, haplotyped homozygous cell-lines also can be used to study gene interactions or epistasis both inside and outside the MHC genomic region 16,58,59 . Expression quantitative trait locus (eQTL) studies associate genomic and transcriptomic data sets from the same individuals to identify loci that affect mRNA expression by linking SNPs to changes in gene expression 58 . Thus, eQTL analysis can be an useful procedure for annotating GWAS variants.
A number of recent studies using homozygous cell-lines and/or biological samples have demonstrated that the expression of various clusters of genes inside or outside the MHC genomic region can be affected by the expression of one or more haplotypic genes within the MHC genomic region 58-61 . Lam et al. used eight homozygous cell-lines, six with Chinese haplotypes (A*33:03-C*03:02-B*58:01-DRB1*03:01 or A*02:07-C*01:02-B*46:01-DRB1*09:01), and two with European haplotypes (A*01:01-C*07:01-B*08:01-DRB1*03:01) 58 . They used haplotypic RNA and DNA-sequencing data to show that haplotype sequence variations represented by eQTL SNP alleles can function as cisacting regulatory variants for multiple MHC genes. The enriched haplotype-specific transcriptional eQTLs were localised especially within four segmental regions containing HLA-A (alpha block), HLA-C (beta block), C4A (gamma block) and HLA-DRB (delta block). Thirty-six MHC genes from extended MHC and classes I, II and III showed significantly differential expression between the three MHC haplotypes.
Lamontagne et al. used hundreds of lung tissue samples collected from patients in Canada and the Netherlands to show that gene expression within the extended MHC region and class I, II and III regions correlated with lung disease/trait specific localand distant-acting eQTL SNPs 60 . By using eQTL analysis of a large human cohort with both RNA-sequencing and genotyping data available for HLA alleles in peripheral blood, Sharon et al. found strong trans-regulatory associations between the HLA-DR, HLA-DQ, or HLA-DP β chains and the T cell receptor (TCR) α chains 61 . Their results suggest that MHC genotypes have a key role in shaping the TCR repertoire by determining the V gene usage profiles of an individual's TCR repertoire. In a recent in-depth interrogation of associations between genetic variation, gene In F, the SNV plot is between the PGF reference sequence (CZUC02000001.1) and the chimpanzee (CHIMP) genomic reference sequence, GCF_002880755.1. The SNV regions label 'del' are genomic sequence regions absent from CHIMP sequence. The Chimpanzee MIC gene in the beta block is a hybrid of human MICA and MICB 53,62 . The Y-axis presents the number of SNP/kb (window size). The X-axis shows the SNP density positions (SNP/kb) across 3.6 Mb of genomic sequence between the MOG and COL11A2 genes. The red vertical lines along the X-axis that are above 100 on the Y axis are artifactual sequences or those representing sequence gaps, poor assembly, inversions or long runs of unspecified nucleotides. The yellow horizontal boxes labelled SNP POOR are regions of recombination (highly conserved nucleotide sequence with little or no SNPs between sequence alignments). In this context, the SNP POOR regions are those that are <1 SNP/kb, in contrast to the same regions that are SNP rich (>1 SNP/kb) in other haplotype sequence comparisons. The MHC class III and most FW genes in the class I region are always SNP poor, and consequently were not labelled as such in A, D or E. The ends of the 'SNP POOR' boxes represent regions of putative crossovers (vertical arrows) between SNP poor and SNP rich regions of different haplotypes in C-E. In B, PGF v LD2B shows the relative absence of SNPs across 3.6 Mb between two conserved (highly similar sequences) haplotypes. Extended genomic regions (>50 kb) with 1-50 SNP/kb are considered to be SNP rich regions, whereas extended regions of < 1 SNP/kb are SNP poor regions. The SNPs within SNP poor regions were easy to count manually because of small numbers (<0.1 SNP/kb), whereas SNP rich regions were difficult to count because of larger numbers at an average of 7 SNP/kb in the alpha block (320 kb), and up to 50 SNP/kb or greater in the delta block (185 kb) depending on haplotype comparisons. In the alpha block, the highest average SNP density between seven different haplotypes was 16 SNP/kb near HLA-A with the lowest density at~2 SNP/kb near the HLA-J pseudogene 13 . The SNP count was <0.001 SNP/kb between the same HLA-A haplotypes in C and D. Spikes and peaks of SNPs above 100 SNP/kb were due mostly to nucleotide misalignments because of poor sequence assembly, structural variations, gaps, inversions or long runs of unspecified nucleotides. See recent SNP plots by Houwaart et al. 49 for additional comparisons between MHC haplotypes. expression and disease, D'Antonio et al. showed that eQTL analyses of HLA haplotypes provided substantially greater statistical power than only using single variants 59 . They examined the association between AH8.1 and delayed colonisation in Cystic Fibrosis, and suggested that downregulation of RNF5 expression was the likely causal mechanism. Taken together, these pioneering eQTL studies incorporating HLA haplotypes are a powerful approach to identify causal genetic mechanisms underlying disease associations both inside and outside the MHC region. In this regard, we recently developed a new RNA-sequencing method to capture differential allele-level expression and genotypes of all the classical HLA loci and haplotypes in the Japanese population for further in-depth studies of graft rejection after transplantation and HLA-related diseases 28 .
Structural variants: indels and transposable elements in MHC genomic evolution and regulation of expression. The human MHC structural variants and indels have received far less attention than SNPs and minor variants with respect to health and disease. In comparative genomic analyses between different MHC haplotypes, the indel diversity is two to seven times greater than SNP diversity 53,62 . Structural variants and indels have a potential gain and loss of functions that can affect phenotypes, susceptibility and resistance to disease via many different molecular, cellular and pathogenic independent and interrelated mechanisms. Figure 3 shows an~55-kb deletion within the alpha block of a haplotype with HLA-A*24:02 13 that has the highest allele frequency of 35.6% in the Japanese population (http://hla.or.jp/ med/frequency_search/en/allele/). HLA-A*24:02:01 apparently has a protective effect against Stevens-Johnson syndrome (SJS) and toxic epidermal necrolysis (TEN) that are life-threatening acute inflammatory vesiculobullous reactions of the skin and mucous membranes 63 .
Transposable elements (TEs) have important, albeit, often poorly defined roles in generating haplotypes via recombination mechanisms such as integration (insertion), duplication, rearrangements, deletions and gene conversion 64,65 . TEs and other repeat sequences appear to have been integral in the generation of MHC segmental duplications of the class I and class II regions 6,66 , and of different haplotypes, mainly by acting both as recombination acceptor and suppression sequence regions for DNA binding Rec proteins and enzymes such as PRDM9 depending on their genomic distribution, sequence conservation or diversity, and evolutionary age of integration and transposition 13,14 . The association of particular TEs and repeats with MHC segmental duplications were reported previously for the genomic structural organisation of MHC duplicated genes in humans 6 , chimpanzees 38,62 and rhesus macaques 67 . Both old and young Alu insertions generate point mutations, microsatellites and SNPs within the flanking regions of the insertion sites 68 . TEs such as Alu, SVA, HERVs and LTR have been used as genetic markers to estimate the evolutionary age of MHC gene duplication events and for discerning the evolutionary interrelationships between different human haplotypes 54,66,69 . For example, ten young AluY indels that are either present or absent in particular human MHC class I and class II haplotypes are useful evolutionary genetic markers of past recombination events, as well as excellent markers for elucidating population phylogenetics and genetic interrelationships [70][71][72] . In this regard, Cun et al. recently showed that five different MHC class II dimorphic Alu elements either alone or linked together as haplotypes with HLA-DRB1 alleles can differentiate 12 Chinese minority ethnic groups according to their geographic locations, and correlate them with their population characteristics of language family, migration and sociality 73 .
TE insertions within the MHC genomic region might act like surgical sutures or band-aids that help to repair and rejoin doublestrand DNA breaks during recombination events 41 , such as those involved with the 'mismatch repair system' or via various other repair mechanisms of damaged DNA 17 . In this regard, it seems that TEs like Alu, L1, SVA and LTR are involved intimately with recombination, DNA repair, as well as contributing to nucleotide point mutations between different sequences 6,13,41 . Moreover, some of these TE indels have been strongly associated with the regulation of gene expression and disease 74,75 . Much work is needed to characterise which MHC TEs have contributed to past recombination events, affect gene expression, and have a role in MHC related diseases, and various important traits and phenotypes associated with pathogen defence.

Population MHC haplotypes
Although homozygous cell-lines can provide phased genomic sequences for analysis of haplotypic structures, population studies are necessary for information about the frequency and distribution of the MHC haplotypes and their association with disease, and for obtaining cross-matching data for organ and cell transplantations. Most frequency data of population MHC haplotypes are based on genotyping HLA alleles of heterozygotes and applying statistical and computation methods such as the expectationmaximisation algorithm or LD values of non-random, multi-allelic correlations between pairs of loci to estimate the correct phase of the haplotypes 76 . The LD statistical analysis of heterozygotes might be reasonably accurate for estimating high frequency or common haplotypes, but the reliability decreases for low frequency or minor haplotypes. Confounders to haplotype estimations include typing ambiguity, sample size, incompleteness of HLA data, allele frequency errors, recombination and especially unknown gamete phase.
A number of family-based population studies were published in the 1980s and 1990s on extended MHC haplotype frequencies for Caucasians in Australia 77 , and the United States 78 , as well as for American non-dominant European Caucasian and non-Caucasian or admixed Caucasian/non-Caucasians 18 . Since then, the HLA haplotype frequencies have been determined for many more different worldwide populations 79,80 , and ethnic groups using pedigrees or statistical inference (http://www.allelefrequencies.net/ default.asp). Table 5 lists examples of the six most common HLA haplotype frequencies for Japanese, Chinese, Saudi, British Caucasians, European Americans (Caucasians) and African Americans deduced by LD inference or segregation by pedigree analysis. Although we used the British Caucasian population as an example of the common European haplotypes such as AH7.1, AH8.1 and AH44.1 (Table 5), the European HLA haplotype frequencies vary markedly among European populations across the European continent 80 . According to Dawkins and Lloyd 46 , the five most common MHC AH haplotypes (at five HLA loci) in Australian Europeans living in Perth, Western Australia are AH8.1 (13.2%), AH7.1 (12.9%), AH44.1 (5.5%), AH44.2 (2.6%) and AH57.1 (2.6%), frequencies which tend to reveal a large immigratory bias towards their British ancestors ( Table 5).
The conserved or fixed haplotypes that have little diversity and no evidence of recombination within their genomic sequences such as AH7.1 or AH8.1 of Caucasian individuals (Table 5) can be studied and described as 'identity by descent' (IBD) haplotypes 81 , which are distinct from 'identity by state' (IBS) haplotypes, that is, those that have emerged by convergence. The highly conserved haplotypes that are shared between generations (haplotype sharing) might remain fixed or frozen over long periods of evolutionary time because of founder effects and population bottlenecks 82 , as well as efficient DNA repair mechanisms, negative population selection, or as yet unknown mutation inhibitory mechanisms. To what degree are conserved haplotypes frozen or fixed? Although this question is not resolved fully, available data suggest that many inherited haplotypes are not completely identical and that de novo mutations, SNPs and/or indels, in MHC genomic sequence comparisons do exist between the same conserved haplotypes [83][84][85][86] . The identification of variants between the same haplotypes might have importance in assisting with optimal donor-recipient selection for allogeneic stem cell transplantation and with reducing acute and chronic graftversus-host disease 26 .
On the other hand, heterozygous haplotypes or those that are very different between individuals (e.g., AH7.1 and AH8.1) are likely to have been inherited by an interplay of various genetic and population evolutionary processes including recombination, positive selection of benign mutations or SNPs, gene flow, genetic drift, frequency-dependent selection, admixture and trans-speciation over long periods of evolution 15,16,80 . For example, the known MHC class I haplotype sequences of Japanese, Africans, Asians, Arabs and Europeans generally are all different to each other in phylogenetic analyses 86,87 . Despite haplotype sharing of high frequency conserved polymorphic sequences by IBD such as those for AH8.1 or AH7.1 10,52 , most haplotypes among Europeans and other populations (Table 5) generally are markedly different in structure, organisation and frequency as a consequence of various hypothetical genetic and population evolutionary processes 80 .  The AH nomenclature is taken from Dorak et al. 47 . The 'x' after the AH B allele is an unknown sequential number that needs to be updated.