A deep learning method for HLA imputation and trans-ethnic MHC fine-mapping of type 1 diabetes

Conventional human leukocyte antigen (HLA) imputation methods drop their performance for infrequent alleles, which is one of the factors that reduce the reliability of trans-ethnic major histocompatibility complex (MHC) fine-mapping due to inter-ethnic heterogeneity in allele frequency spectra. We develop DEEP*HLA, a deep learning method for imputing HLA genotypes. Through validation using the Japanese and European HLA reference panels (n = 1,118 and 5,122), DEEP*HLA achieves the highest accuracies with significant superiority for low-frequency and rare alleles. DEEP*HLA is less dependent on distance-dependent linkage disequilibrium decay of the target alleles and might capture the complicated region-wide information. We apply DEEP*HLA to type 1 diabetes GWAS data from BioBank Japan (n = 62,387) and UK Biobank (n = 354,459), and successfully disentangle independently associated class I and II HLA variants with shared risk among diverse populations (the top signal at amino acid position 71 of HLA-DRβ1; P = 7.5 × 10−120). Our study illustrates the value of deep learning in genotype imputation and trans-ethnic MHC fine-mapping.


Supplementary Note 1. Extended evaluations of imputation accuracies of DEEP*HLA
To benchmark the accuracies of DEEP*HLA more comprehensively, we tested its performance in various aspects.

a. Effects of down-sampling on accuracy of DEEP*HLA
We evaluated the effects of down-sampling of training data on accuracies of DEEP*HLA. We performed 10-fold cross-validation for both reference panels and independent samples for our Japanese reference panel, where only part of an original training fold was used as a training fold. We tested a down-sampling rate of 90, 80, 70, 60, 50, 40, 30, 20, and 10%. The results are shown in Supplementary Fig. 1.

b. Comparison with single-task neural networks, and multi-task neural networks with shuffled groupings
We evaluated the advantages of the multi-task learning with grouping. The multi-task learning would be effective mainly in our Japanese reference panel in which more HLA gene loci were genotyped than T1DGC panel; thus, we tested only for our Japanese panel.

b.1. Comparison with single-task neural networks
We tested the performance of single-task neural networks that imputed all genes separately.
To perform a fair comparison, the input regions were set to the same as DEEP*HLA with the original grouping. As shown in Supplementary Fig. 11a, all the accuracies were lower than the multi-task DEEP*HLA in all the ranges of allele frequencies. Moreover, the mean training time in the cross-validation was 192 min per one iteration, which was over 5 times longer than the multi-task learning (36 min).

b.2. Comparison with multi-task neural networks with shuffled grouping
To evaluate the advantage of the original grouping, we evaluated the performance of models with shuffled grouping. We investigated two cases: (A) shuffling HLA genes between group 1 and 2, and between 3 and 4; (B) shuffling HLA genes among group 1, 2, 3, and 4. We tested 5 different groupings for each case. As shown in Supplementary Fig. 11b, DEEP*HLA with the original grouping was significantly outperformed those with the shuffled groupings. The groupings (A) tended to perform better than the groupings (B). These results suggest the importance of grouping based on the physical distance and LD structures.

c. Comparison among different input window sizes
We benchmarked DEEP*HLA with different window sizes of 250, 750, and 1,000 kb in addition to 500 kb (Supplementary Fig. 12). Although the optimal window size might vary by locus in rare allele, there was no significant difference overall.

d. Strict cross-validation including haplotype pre-phasing
HLA references panels in a phased condition with all the subjects were used for the crossvalidation shown in the main text. In a real scenario, however, reference data (i.e. a training fold) and target data (i.e. a validation fold) are more likely to be independently phased. Thus, we conducted stricter cross-validation for the accuracy of DEEP*HLA in which each training data was pre-phasing after separation. As shown in Supplementary Fig. 15, there were no significant overall changes in the accuracies but a slight decline in alleles with a frequency < 0.5%. Especially, alleles with a frequency < 0.1% correspond to doubleton (or singleton) in the Japanese panel; thus, separate pre-phasing is likely to have a slight effect on imputation performance.

Supplementary Note 2. An illustration of accuracy metrics for imputed dosages used in our study
Based on a cross-tabulation table (Supplementary Fig. 14a), we defined a per-allele sensitivity of imputed dosage as where m denotes the number of true observations of allele A in total sample, and Di represents imputed dosage of allele A in individual haplotype j which has allele A. TP (true positive) and FN (false negative) are illustrated in the cross-tabulation table.
Accuracy of a locus defined in the paper of SNP2HLA Acc is calculated by summing across all individuals the dosage of each true allele in the individual (i.e. the sum of true positives of all individual alleles), divided by the total number of observations. As shown in Supplementary Fig. 14b, it is consistent with a weighted-mean of the per-allele sensitivity by allele frequencies as where n denotes the number of individuals, Di represents the imputed dosage of an allele in individual i, and alleles A1i, L and A2i, L represent the true HLA alleles for individual i at locus L. TPa, FNa, and freqa denotes the true positive, false negative, and allele frequency of an allele a. This is why we termed Acc as a sensitivity for each locus.

Supplementary Figure 3. Receiver operating characteristic curves for ability for entropy-based uncertainty and genotype dosage of discriminating incorrectly imputed 4-digit alleles.
The entropy-based uncertainty was able to discriminate incorrectly imputed 4-digit alleles with a higher accuracy than genotype dosage both in the Japanese panel (a) and T1DGC panel (b). AUC, area under the curve.

Supplementary Figure 5. Comparison of odds ratios of T1D risk-associated variants
in HLA-DRB1 and -DQB1 between Japanese and Europeans.
Odds ratios (ORs) of the variant observed in our study (left) and of reported previously (left) in HLA-DRB1 and HLA-DQB1 are plotted based on those in Japanese (horizontal axis) and Europeans (  In DEEP*HLA, the total processing times were determined by summing phasing GWAS data time (by Eagle), training time (153 min), and imputation time. In HIBAG, the sums of training time (3,273 min) and imputation were regarded as the total processing time.