Introduction

The sequencing of the entire human genome in 2007, after a 4-year process, provided important insights into our complete genomic makeup.1 Subsequently, the genomic sequences of J. Watson, African, Chinese, Korean and Japanese genomes were reported.2, 3, 4, 5, 6 Analyses of the personal genomes of individuals have provided information on human genetic variation and complexity. Additionally, rapid progress in next-generation sequencing (NGS) technology has led to revolutionary changes in medical genomics, supplying massive sequencing data for human samples. Indeed, the 1000 Genome Project has already reported novel variants, both rare and common, from population-scale sequencing.7

Various study designs have been applied to NGS, including DNA target resequencing, RNA sequencing for transcriptome analysis, chromatin immunoprecipitation sequencing, bisulfite sequencing for methylome analysis and others. The Encyclopedia of DNA Elements (ENCODE) project has examined the role of 99% of non-protein-coding DNA,8 revealing substantial interactions between proteins and DNA and the transcription of functional elements other than mRNA encoding proteins. Moreover, various types of NGS technologies have been developed, including smaller-scale benchtop and long-read NGS systems. Benchtop NGS systems, such as GS Jr, ionPGM and MiSeq, allow researchers to make fine adjustments for various smaller-scale studies. For example, some panels of focused target genes, such as genes related to cancer and inherited diseases, are now available for sequencing.

Human leukocyte antigen (HLA) genes have a long research history as important targets in biomedical science and treatment. The HLA region on chromosome 6p21 comprises six classical HLA genes and at least 132 protein-coding genes. This region has important roles in regulation of the immune system as well as fundamental molecular and cellular processes.9 The sequencing of a continuous 3.6-Mb HLA genomic region with annotation of 224 genes was reported by the MHC Sequencing Consortium in 1999.10 In addition, the MHC Haplotype Project was carried out between 2000 and 2006 by the Sanger Institute, providing genomic sequences and gene annotations of eight different HLA-homozygous haplotypes to build a framework and resource for association studies of all HLA-linked diseases; these haplotypes were registered as UCSC hg19 or NCBI GRCh37 reference assemblies.11, 12, 13 This small segment of 3.6 Mb occupies only 0.13% of the human genome but is associated with more than 100 diseases, mostly autoimmune diseases such as diabetes, rheumatoid arthritis, psoriasis and asthma. Furthermore, specific alleles of the HLA genes are strongly associated with hypersensitivities to specific drugs. For example, strong associations between carbamazepine-induced Stevens-Johnson syndrome or toxic epidermal necrolysis and HLA-B*15:02,14, 15 abacavir-induced liver injury and HLA-B*57:01,16, 17, 18, 19 and allopurinol-induced Stevens-Johnson syndrome or toxic epidermal necrolysis and HLA-B*58:0120 have been reported in various populations. For a better understanding of the disease causality and adverse effects of drugs, the haplotype structure of the HLA region should be extensively and unambiguously determined. Therefore, a specific analytical procedure should be developed for completion of HLA sequencing and haplotype determination. NGS technologies have potential advantages over the Sanger method in the sequencing of HLA genes, that is, sequences of haplotype structure can be obtained at high throughput.

To date, several high-throughput HLA-typing methods using NGS have been developed.21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42 Importantly, HLA typing using NGS provides both high-throughput and high-resolution capabilities (Figure 1). Additionally, as reported by the ENCODE Project, HLA gene sequencing alone is not sufficient for developing a complete understanding of the genetic makeup of the HLA locus. The expression levels of HLA genes can have crucial roles in the pathogenesis of diseases; thus, detection of regulatory single-nucleotide variants (SNVs) and insertions and deletions (Indels) located outside of exons is necessary. If phase-defined complete sequencing of HLA genes, including functional regulatory regions, is performed, novel alleles associated with disease risks and adverse effects of drugs could be obtained, and the expression levels of genes that affect biological processes could be clarified.

Figure 1
figure 1

HLA typing to provide sequencing data for the HLA gene(s) and regions. The HLA sequencing data of NGS could be analyzed from various points of view. The minimum scope of polymorphisms is the genotype of an SNV, and the maximum scope is the HLA haplotype sequence as a set of alleles from each HLA gene. The phase-determined sequence of the HLA allele can be applied for HLA typing as a reference. The resolution of HLA typing is classified into the following four categories: two-digit for alleles, four-digit for specific HLA proteins, six-digit for specific HLA coding sequence (CDS) and eight-digit as specific HLA genome sequences including untranslated regions and introns.

PCR-based HLA sequencing using NGS

PCR-based methods, involving an amplicon-sequence step and a sequence capture step, are commonly used for library preparation. Most of the NGS-based HLA-typing methods have been developed using such techniques. In 2009, two HLA-typing methods using a Roche GS FLX system were reported (Table 1).21, 22 The first NGS-based HLA-typing method focused only on key exons, which have commonly been analyzed using sequence-specific oligonucleotide-primed PCR (PCR-SSO) with fluorescent beads and sequencing-based typing (PCR-sequencing-based typing) using direct sequencing. Additionally, various PCR designs, such as long PCR and reverse transcription-PCR, have been applied for NGS HLA typing.21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42 These PCR-based HLA-typing methods are primarily different based on primer design and the type of sequencer (Figure 2a). In particular, the long PCR method enables sequencing of the entire HLA gene, including the intron, untranslated region, and upstream and downstream regions, thus realizing high-resolution and high-throughput HLA typing. Importantly, HLA typing should be carried out by determining complete HLA gene sequences based on the physical determination of DNA sequences, but not HLA-type imputation or estimation based on the IMGT/HLA database. Indeed, the phase-defined sequencing method includes an HLA-typing method as a part of the pipeline for determination of complete HLA gene sequences.33 Moreover, some studies have shown that PCR dropout or allelic imbalance may occur during the PCR step; these issues are unpredictable and tedious to resolve. Several companies have recently released NGS HLA-typing kits based on long PCR products for library preparation; these kits include Illumina TruSight HLA, One Lambda NXType, GenDX NGS-go AmpX and Omixon Holotype HLA. Using these kits, 11 (HLA-A, -C, -B, -DRB3, -DRB4, -DRB5, -DRB1, -DQA1, -DQB1, -DPA1 and -DPB1), 8 (HLA-A, -C, -B, -DRB1, -DQA1, -DQB1, -DPA1 and -DPB1), 5 (HLA-A, -C, -B, -DRB1 and -DQB1) and 5 (HLA-A, -C, -B, -DRB1 and -DQB1) genes have been amplified, respectively.

Table 1 PCR-based HLA typing using NGS
Figure 2
figure 2

Preparation of HLA gene fragments for the DNA library. DNA fragments of the HLA genes are prepared by PCR-based (a) or hybridization-based (b) methods. (a) Many publications describing PCR-based methods have used different PCR designs such as short PCR for target exons (blue bar) or long PCR for entire genes (red bar). After PCR amplification, each of the pooled PCR products is applied for library preparation with/without fragmentation to add adapters with/without indexes for each sequencer. In the PCR-based method, the first step is PCR for HLA genes and the second step is library preparation. (b) The sequence capture method based on hybridization is also commonly used to enrich HLA gene fragments. DNA/RNA probes with the HLA gene sequence are hybridized to the DNA library, which includes the HLA gene sequence. The biotinylated probes-bound DNA libraries are collected using a magnet and streptavidin magnetic beads. In the sequence capture method, the first step is library preparation and the second step is enrichment for HLA genes. (c) After sequencing, HLA gene sequences of each individual are reconstructed by alignment to reference HLA gene sequences. The consensus sequences constructed by the aligned reads are searched for specific HLA alleles in the IMGT/HLA database. In the NGS-based HLA-typing method, the basic data analysis approach is similar between PCR-based and sequence capture methods.

The capture method for HLA sequencing

Target resequencing of the HLA genes using the sequence capture method has not been well developed compared with PCR-based HLA typing. The sequence capture method is based on hybridization between DNA of an adapter-ligated library and a biotinylated DNA/RNA probe designed based on target sequences of genes or the genomic region (Figure 2b).43 Hybridized DNA fragments are enriched for the target region using streptavidin magnetic beads. Wittig and colleagues44 reported the first automated HLA-typing method based on the sequence capture technology. This method uses targeted capturing of the classical class I (HLA-A, HLA-C and HLA-B) and class II (HLA-DRB1, HLA-DQA1, HLA-DQB1, HLA-DPA1 and HLA-DPB1) HLA genes. The DNA fragments from these eight HLA genes can be simultaneously enriched by a hybridization reaction in a single tube, without allele dropout, which is frequently observed in PCR-based methods. The results showed high accuracy for allele call (99%) and identified errors in the IMGT/HLA reference database. It is also notable that the sequence capture method is generally applicable for NGS-based target resequencing of larger genomic regions and a larger number of genes than the PCR-based methods. On the basis of these features and the findings from the automated NGS-typing method, the sequence capture method has major advantages over PCR-based methods and is a promising method for HLA sequencing.

Sequencers for HLA gene sequencing

Various sequencing machines have been developed for NGS HLA typing. The majority of published methods have been established using Roche sequencers. However, for the last few years, the Illumina MiSeq instrument has also been used for HLA typing (Table 1). The types of NGSs used for HLA typing may often change along with improvements in NGS technologies. The Pacific Biosciences PacBio RS II sequencer, which is capable of generating enormously long reads in a single molecule using real-time sequencing, was recently developed for HLA typing. Single molecule using real-time sequencing is highly effective in generating accurate, phased sequences of full-length alleles of HLA genes. Complete phasing of the HLA genes from single molecule using real-time sequencing may resolve phase ambiguity, which is a fundamental problem of conventional HLA-typing methods.

HLA typing from WGS and WES data

During the past few years, whole exome sequencing (WES) has identified the causalities of a large number of Mendelian diseases by analyzing familial samples and/or sporadic patients.45 In addition, WES has facilitated the acquisition of massive amounts of data in various genome sequencing projects, such as the 1000 Genomes Project,7 NHLBI GO Exome Sequencing Project (https://esp.gs.washington.edu/drupal/) and UK10K project (http://www.uk10k.org), which are expected to improve our understanding of variations in the human genome. The sequence capture method has also been applied for WES using various kits, such as the Agilent Human All Exon kit, Roche SeqCap EZ Human Exome kit and Illumina TruSeq Exome Enrichment kit. The respective libraries of capture oligo-probes, which cover all human exons, are designed to target all exons of all HLA genes. For example, 820 exons of 182 genes in the HLA region are found in the Agilent Human All Exon kit design. Because DNA sequence reads for HLA genes are included in the whole genome sequencing (WGS) and WES datasets, HLA typing could be carried out using both the datasets. Within WGS and WES datasets, HLA gene sequences represent only a small portion of the data, but these sequences are phased HLA gene sequences as HLA alleles. Therefore, HLA typing from WGS or WES datasets should be the key analysis method used to promote higher accuracy rates compared with those of the existing PCR-SSO or PCR-sequencing-based typing results.

NGS HLA-typing software

As described in Table 2, various HLA-typing software programs, including the aforementioned Omixon Target HLA and HLAminer, as well as academic software and commercial software packages, have been developed for HLA typing from various types of data, including WGS, WES, RNA sequencing and amplicon datasets.46, 47, 48, 49, 50, 51, 52, 53, 54 An example of HLA-typing software for WGS or WES, including a brief overview of the Omixon Target HLA typing system, is described in Figure 3. For statistical methods, sequence reads are first aligned to the whole IMGT/HLA database (all known HLA alleles). Then, the best matching alleles are selected based on various alignment statistics, such as the number of reads covering exons and the extent of exons covered. During statistical analysis, only reads that are mappable as homologous to any allele in the IMGT/HLA database with a low number of mismatches should be stored. In the Omixon publication, which used data from the 1000 Genomes Project, the concordance rate between the NGS-based method and PCR-SSO was around 90%, which was not considered high.54 For the analysis, sequence reads from all exons of HLA genes were applied. At least 10 reads are required to counterbalance random noise. However, the sequence reads were not evenly distributed for each gene region, and the average depth implied that there may be holes in coverage. Another publication in which the authors utilized the HLAminer software also mentioned 92.8% concordance rate between these two methods for allele group prediction.46 NGS HLA typing can call for all the HLA genes recorded in the IMGT/HLA database and for novel HLA alleles. On the other hand, it is not currently possible to detect rare HLA alleles by PCR-SSO. Therefore, if sequence reads of rare HLA alleles are in WGS or WES, the HLA-typing results from NGS would be expected to be discordant with those from PCR-SSO. In the near future, the reliability of these HLA-typing methods from WGS and WES data may be improved. These programs are the next step in developing methods with greater specificity and sensitivity of HLA-typing results. In particular, the specificity is dependent on the HLA-typing resolution, for example, two-, four-, six- and eight-digit, each of which is based on the composition of the allele group, the specific allele protein, the specific DNA sequence with synonymous substitutions in the coding region and the specific DNA sequence of the entire gene, respectively. The high-resolution HLA typing of NGS is advantageous compared with the existing PCR-SSO and PCR-sequencing-based typing methods. In practice, it is not possible to execute complete eight-digit HLA typing because of limitations in the number of known HLA allelic sequences deposited in the IMGT/HLA database, where most HLA allelic sequences have been recorded as coding sequences or partial exons. Only HLA alleles recorded as full-length HLA gene sequences can be used for allele-call with eight-digit resolution. To put eight-digit resolution typing into practice, the NGS-based phase-defined complete sequencing methods for the HLA genes will be applicable as a high-resolution tool for the detection of novel alleles, and will facilitate the development of expanded databases with full-length HLA allelic sequences for eight-digit HLA typing.33 The success of complete HLA gene sequencing with high accuracy should be determined based on high sequence read depth. In the case of HLA-B sequencing, the minimum depth for complete phasing was approximately 800 folds the average depth.37

Table 2 HLA-typing software and category of acceptable reads
Figure 3
figure 3

Overview of data analysis for HLA typing using WGS/WES. Massive sequence reads from WGS/WES are aligned to the whole IMGT/HLA database (all known HLA alleles) to search for best matching alleles based on alignment statistics, number of reads covering exons and the extent of exon coverage. The HLA allele can be identified by only storing reads that are mappable as homologous to any allele in the IMGT/HLA database with a low number of mismatches by statistical analysis.

HLA in the ENCODE project data

In HLA research, NGS technologies have influenced HLA typing as well as our understanding of the functional regulatory regions within the HLA region, which could affect the expression of HLA genes. Thus far, HLA-associated diseases have been understood on the hypothesis that antigen presentation of HLA molecule affects the immune system. Four-digit HLA typing is sufficient to explain the importance of antigen presentation in disease causality. On the other hand, the HLA-associated phenotypes could also be affected by the expression levels of HLA genes or by allelic imbalance. In 2012, the ENCODE project succeeded in the systematic arrangement of transcript regions and transcription factor (TF)-binding sites in the genome, and showed the genomic patterns of chromatin structure and histone modifications.8 The achievements of the project also include the discovery of putative functional elements and domains within the HLA region. The knowledge obtained in the ENCODE project could be extended to understand HLA-associated diseases and phenotypes. Certain diseases are associated with specific HLA alleles, and many variants within the HLA genes are also associated with HLA alleles in linkage disequilibrium; therefore, it is quite difficult to genetically determine which variant is associated with the disease because the disease is associated with a haplotype carrying the HLA alleles and many variants. Furthermore, there are limitations to genetic analyses with limited numbers of samples and minor genetic effects; however, the ENCODE project highlights the functional regions of the entire human genome including the HLA region.

Two examples of HLA genes and the associated diseases are described in Figure 4, Table 3 and Table 4. HLA-DRA is less polymorphic than HLA-A, -C, -B or -DRB1. However, many variants have been observed in the upstream region among HLA-DRA alleles, particularly between the same six-digit HLA-DRA alleles, HLA-DRA*01:01:01 (Figure 4, unpublished data). A deletion of about 2 kb was also detected in the upstream region of HLA-DRA*01:02:02 by HLA target resequencing data. Before the completion of the ENCODE project, it was difficult to understand the effects of deletions. Now, we can see the possibility of a functional regulatory region around the deletion. Interestingly, two haploid genome sequences of HLA-DRA*01:01:01 had different sequences within the intron and upstream regions. Some of the variants also may affect the expression levels of HLA-DRA by mediating TF binding to the variants. The haplotype of the variants in the upstream region could be significantly different, even though the HLA allele was found to be the same as the HLA-DRA*01:01:01 sequence with six-digit resolution. The ENCODE project stressed the importance of complete HLA gene sequencing, including the upstream regulatory region, to determine the haplotype. In total, 3619 SNVs in the HLA region were selected as expression Quantitative Trait Loci (eQTL) SNVs for HLA gene expression (Table 3).55 These eQTL SNVs were identified in the RegulomeDB database (http://www.regulomedb.org), which have provided annotations of SNPs with known and predicted regulatory elements in the intergenic regions of the human genome. The database includes public datasets from the ENCODE project, in addition to GEO and publications. Known and predicted regulatory DNA elements from DNAase hypersensitivity, TF-binding sites and promoter regions that have been biochemically characterized to regulate transcription are also included. Recorded variants have been classified into various categories according to TF binding and target gene expression. The 3619 HLA eQTL SNVs are likely to affect the binding of TFs to mediate expression of the HLA gene. For variants and deletions near HLA-DRA, new hypotheses concerning the biological functions of the gene could be generated to improve our understanding of HLA-DRA-associated phenotypes.

Figure 4
figure 4

Example of target resequencing to detect variants and functional prediction of the regulatory region. Target resequencing of the HLA region clarifies all variants in the target region. For example, several variants and approximately 2-kb deletions have been detected in the upstream region of HLA-DRA. (a) Alignment view of mapped reads (pink: forward strand read, purple: reverse strand read) in the alignment track for detection of SNVs (A: green, C: blue, G: yellow, T: red) and the deletion as displayed in the coverage track. (b) The region was located in cis-regulatory elements as active (H3K27ac-marked) enhancers and a DNase I-hypersensitive site defined by ENCODE chromatin immunoprecipitation sequencing and DNaseI-seq peaks. The deletion and SNVs may affect the expression level of HLA-DRA by influencing the binding of TFs.

Table 3 Number of eQTL SNVs linking expression level of HLA genes
Table 4 Lead SNVs linking rheumatoid arthritis association with regulatory information in the human genome

In another example, 12 SNVs with regulatory functions have been shown to be associated with rheumatoid arthritis (Table 4). Of the lead SNVs, rs660895 is located 32.6 kb upstream of HLA-DRB1 (from the nearest transcription start site) and has been described as a tag SNP for the HLA-DRB1*04:01 allele. The HLA-DRB1*04:01 allele has been shown to be associated with a higher risk of rheumatoid arthritis (OR: 6.2).56 From chromatin immunoprecipitation sequencing and DNaseI-seq peak data from the ENCODE project, it was found that the SNP is located within regulatory regions. Other eight SNVs are shown to be located within TF-binding sites predicted by in silico motif discovery. The information from ENCODE data will help decision-making for additional and follow-up experiments to obtain reliable evidence for the mechanism through which SNVs in the HLA region contribute to the development of RA.

Future directions

In 2005, Roche launched the first NGS instrument, the Genome Sequencer 20. The Genome Sequencer 20 was able to achieve a read length of about 100 bp and could sequence 20 Mbp per run. Within the last decade, rapid progress in NGS technology has resulted in revolutionary changes in medical genomics for applications in genetic diagnosis, called clinical sequencing or medical exome. However, the two commonly utilized methods for HLA typing, PCR-SSO and PCR-sequencing-based typing, are still the first-line methods in HLA research and diagnosis for more than 10 years. Recently, several manufacturers have begun to develop HLA-typing kits for NGS; thus, elucidation of the complete HLA gene sequence will soon provide new knowledge that will be useful for medical science. However, gene sequence of the HLA region alone will be insufficient for a complete understanding of HLA and all of the HLA-associated phenomena. For this purpose, phase-defined sequencing and haplotype determination of all regions including the HLA genes and regulatory sequences in the HLA region are essential. Further analyses will be required to determine the transcription of the fundamental ‘HLA’ unit, including the HLA genes and all associated targets involved in the HLA functional pathway, along with physically interacting targets and regulatory regions containing TF-binding sites. These must all be considered carefully to develop a complete understanding of ‘HLA’, that is, HLA-omics analysis. Finally, the goal of HLA typing as complete gene sequencing should be clinical applications that will benefit patients. Future HLA-typing methods will help realize the goal of ‘precision medicine’ by determining biologically distinct subgroups for precisely targeted treatments.57