HLA-A, -B, -C, -DRB1, -DQA1, and -DQB1 allele and haplotype frequencies defined by next generation sequencing in a population of East Croatia blood donors

Next-generation sequencing (NGS) is increasingly used in transplantation settings, but also as a method of choice for in-depth analysis of population-specific HLA genetic architecture and its linkage to various diseases. With respect to complex ethnic admixture characteristic for East Croatian population, we aimed to investigate class-I (HLA-A, -B, -C) and class-II (HLA-DRB1, -DQA1, -DQB1) HLA diversity at the highest, 4-field resolution level in 120 healthy, unrelated, blood donor volunteers. Genomic DNA was extracted and HLA genotypes of class I and DQA1 genes were defined in full-length, -DQB1 from intron 1 to 3′ UTR, and -DRB1 from intron 1 to intron 4 (Illumina MiSeq platform, Omixon Twin algorithms, IMGT/HLA release 3.30.0_5). Linkage disequilibrium statistics, Hardy-Weinberg departures, and haplotype frequencies were inferred by exact tests and iterative Expectation-Maximization algorithm using PyPop 0.7.0 and Arlequin v3.5.2.2 software. Our data provide first description of 4-field allele and haplotype frequencies in Croatian population, revealing 192 class-I and class-II alleles and extended haplotypic combinations not apparent from the existing 2-field HLA reports from Croatia. This established reference database complements current knowledge of HLA diversity and should prove useful in future population studies, transplantation settings, and disease-associated HLA screening.

in development of region specific HLA heritage 10 . The inter-and intra-population HLA comparisons made today are however, mostly constrained to a partial, exon description of HLA genetic variability (1 st and 2 nd fields of HLA nomenclature), whereas synonymous (3 rd field) and non-coding (4 th field) nucleotide variations remain largely unexplored. In such settings, it remains challenging to accurately build contiguous long haplotypes of uniformly high resolution even for the largest sample cohorts. The implementation of NGS technologies in HLA research and routine clinical work, enables however, elucidation of full-length HLA gene sequences, permitting an in-depth characterisation of population HLA diversity. In this context NGS can overcome limitations of traditional typing techniques, thereby sustaining optimal HLA matching of donor-recipient pairs for organ and particularly, haematopoietic stem cell transplantation (HSCT) 11 , improved estimates of population structure and HLA associated disease risk [12][13][14] , and better understanding of demographic history and geographic origin of a given population 15 .
Up to date, HLA allelic and haplotype diversity of a general Croatian population, irrespective of specific geographical preferences, was estimated in several large cohorts originating from Croatia [16][17][18][19] and emigrant population in Germany 20 , as well as few, more isolated populations from particular geographical locations such as island Krk 21 , island Hvar 22 , Istrian city of Rijeka 23 and Gorski Kotar 24 . Moreover, several non-frequent, rare and very rare HLA-A, -B and -DRB1 alleles and haplotypes have been characterized among the unrelated volunteer donors from the Croatian Bone Marrow Donor Registry (CBMDR) 25 . In addition, DPB1 allelic diversity was recently evaluated in 82 Croatian patients who underwent HSCT 26 . These previous studies were however, based on lower resolution (1 st and 2 nd field) HLA typing of selected HLA loci in individuals originating from various Croatian regions, providing partial insight into the HLA diversity of a general, but not East Croatia population.
The aim of the present study was thus to investigate and describe extended allelic and haplotype diversity of HLA-A, -B, -C, -DQA1, -DQB1 and DRB1 loci at high-resolution, 4 th field level, using high-throughput NGS technique for HLA typing of 120 healthy, unrelated blood donor volunteers from east Croatia.

Results
NGS sequencing results. We evaluated 120 donors (120 donors × 6 loci × 2 alleles = 1440 alleles), 9 of whom were excluded from further analysis due to low-performing samples on quality control check [low read count (≤2500 bp for class I and ≤5000 bp for class II genes) and/or low key exon coverage depth (≤30)]. The coverage of each locus in the remaining samples (111 donors, 1332 alleles) was calculated by Twin as the percentage of gene regions covered by reads compared to the whole allele sequence (coverage %). The Omixon Holotype primer positioning allowed only partial amplification of 3′UTRs and 5′UTRs, and covered exons 2-6 of DQB1 and exons 2-4 of DRB1 loci (Fig. 2).
Ambiguous allele calls, which remained unresolvable due to inherent product limitations of the Omixon Holotype Kit (missing data on SNPs or INDEL variations within the unsequenced 3′ UTR, 5′ UTR and intron 1/exon 1 regions), were reported as ambiguous (i.e. DQB1*06:01:01/15), or up to the third field level only (i.e. DQB1*05:03:01), and are enlisted in Supplementary Table 1. Nevertheless, all amplified exon/intron regions of each allele in the remaining samples were fully covered (detection %), with an average coverage depth of >140 reads per nucleotide position (Supplementary Table 2). On average, 54359 (Supplementary Table 3) high-quality reads were produced per sample, of which 91% were subsequently used for final consensus generation after removing noise reads and PCR crossover artefacts. The average fragment size was 259 bp, and the average read length 208 bp.
The HLA-DRB1~DQA1~DQB1 haplotype diversity is presented in Supplementary  Table 8), 10 appeared in three or more copies (Table 5)

Discussion
This study represents the first report of HLA diversity in an east Croatian population of healthy blood donors, as studied by the next generation sequencing of 6 HLA genes, providing extensive exon/intron coverage with minimum ambiguity. For the first time, the allele frequencies and the extended six gene haplotypes of Croats were examined at the 4-field resolution level and compared to the largest repository of HLA class I and class II data in Croatia, the Croatian bone marrow donor registry (CBMDR).
The comparison of HLA-A allele frequencies between our and CBMDR inventory did not reveal significant differences, and the ranking hierarchy of the most common A*02:01:01, A*01:01:01:01, A*03:01:01:01, and A*24:01:01:01 alleles was also the same. Greater differences in frequency rate between general and east Croatian population were, however, noticed among HLA-B allelic variants. The HLA-B*51:01 allelic group was the most frequent in both general and east Croatian cohort, but the frequency rank of the remaining HLA-B allelic variants was different, which was particularly evident for our 6 th ranking B*18:01 allele group, previously reported as the 2 nd most frequent allelic variant in the CBMDR (8.16%) 17 and one Croatian family study (8.27%) 18 . Among five different B*18 alleles detected in the Croats so far 28 , we observed only two, the B*18:01:01 (5.14%) and the B*18:03 (0.90%), resulting in a B*18:01 distribution closer to those reported in Armenians 29 , Germans 30 , Austrian and the Turkish minority in Germany 20 , Bulgarian Roma individuals 31 , and Iranian Kurds 32 . Moreover, the frequency of the 2 nd most common HLA-B allelic group in our cohort, the HLA-B* 35 www.nature.com/scientificreports www.nature.com/scientificreports/ 5 th in the CBMDR), was more similar to those observed in Turks 20 , Serbians 9 , and Italians 27 . A strong influence of south-eastern European populations on the HLA makeup of eastern Croats was further supported by the high prevalence of the HLA-B*27:02:01:01 variant, which fits well into the B*27:02 frequency gradient diminishing from the Middle East towards the Central and West European countries. In support, the observed B*27:02:01:01 frequency (3.6%) seems to be in close agreement with the B*27:02 cline extending across the south-eastern Tunisian (5.8%) 33 , Bulgarian (4.6%) 34 , CBMDR (2.14%) 17 , Czech (1.9%) 9 , and Polish (1.5%) 35 populations. The B*44:27:01 allele, an east European marker considered a rare variant according to the "Rare Alleles Detector" tool 9 , was also noticed in our cohort at a relatively high frequency (2.7%), contrasting observations from Croatian HSCT patients (1.18%) 19 , Czech National Marrow Donors Registry (0.69%) 9 , as well as Polish (0.8%) 35 , English (0.19%) 36 , and Argentinian blood donors (0.07%) 37 . As minor allele within the functionally identical B*44:02 G group, the B*44:27 variant has been reported at a relative ratio frequency of >5% among the B*44:02:01G-positive Bulgarian (36.82%), Hungarian (9.4%), Slovenian (25.60%) and Portuguese (6.17%) individuals 38 . In our sample, the B*44:27:01 relative ratio frequency within the B*44:02:01G-positive individuals (37.5%) sets the local estimate at the upper boundary.
Several ranking differences were detected between the general Croatian and our population at the HLA-C loci as well. The 1 st (C*07:01, AF = 21.77%), and the 2 nd (C*04:01, AF = 15.59%) highest ranking alleles from   www.nature.com/scientificreports www.nature.com/scientificreports/ the CBMDR inventory were ranked 2 nd (14.41%) and 1 st (18.01%) in east Croatian blood donors, respectively. The ranking sequence in our cohort continued with C*02:02:02 (9.46%), and C*07:02:01 (8.11%) allele group, which have been deemed 4 th and 5 th highest ranking alleles in the CBMRD. Overall, the ranking hierarchy of HLA-C allele groups in our sample was more similar to the one reported for Greece, and the Turkish minority in Germany 20 .
In conclusion, the present study provides an in-depth characterisation of HLA diversity in eastern Croats, revealing distinctive allele and haplotype detail consistent with the complex population history of the studied geographic region. The data complement and refine the existing estimates of HLA diversity in the Croatian population, increase population and geographic coverage by NGS data, and add granularity to clinically and genetically relevant HLA data. The study represents a useful reference for population and HLA-disease association studies; however, larger sample size and sequence coverage, particularly for the DQB1 and DRB1 genes, remain a prerequisite for the future studies.

Materials and Methods
Subjects. The study collection consisted of 120 healthy, unrelated, blood donor volunteers ( (Fig. 2). The HLA genotyping workflow was initiated by long-range PCR amplification of class I and class II HLA loci in a separate, sample and locus specific 25 µl reactions, comprising 2.5 µl of PCR buffer, 1.25 µl of dNTP mix, 2 µl of locus specific primers, 0.4 µl of LR PCR enzyme, and 5 µl of genomic DNA (≈30 ng/µl). Combined DQB1 enhancer (5.6 µl/sample) was added to the DQB1 master mix only. The conditions for class I gene amplification on Mastercycler nexus thermal cycler (Eppendorf, Hamburg, Germany) were set as follows: 95 °C for 3 min, followed by 35 cycles of 95 °C for 15 s, 65 °C for 30 s and 68 °C for 5 min, and a final incubation at 68 °C for 10 min. For class II genes, the conditions were: 95 °C for 3 min, 35 cycles of 93 °C for 15 s, 60 °C for 30 s and 68 °C for 9 min, followed by final extension at Data analysis. The best matching alleles were selected according to the alignment statistics (described in section 4.6), and homology to alleles available in the IMGT/HLA 3.30.0_5 database 6,7 . If more than one allele call was available for a specific locus, the ambiguity was resolved by re-analysis of increased number of reads processed from the input files. The remaining ambiguous allele calls (presented in Supplementary Table 1) were referenced against the "Omixon Holotype HLA and Omixon HLA Twin known product limitations" (missing data on SNPs or INDEL variations within the unsequenced 3′ UTR, 5′ UTR and intron 1/exon 1 regions), and were hence reported as ambiguous (i.e. DQB1*06:01:01/15) or up to the third field level only (i.e. DQB1*05:03:01). The Common and Well-Documented (CWD) allele catalogue (version 2.0.), and "Rare Allele Detector" tool (www. allelefrequencies.net), were used for the identification of rare HLA alleles. Nine out of 120 samples were excluded from this study due to Omixon Twin quality control failure. HLA-A, -B, -C, DRB1, -DQA1, -and -DQB1 loci were successfully sequenced in all remaining samples (n = 111).
Quality control (QC) metrics. The Omixon Twin software combines statistical alignment and de novo assembly algorithms for robust allele calling. The default minimum number of reads required for reliable locus mapping was set at ≥2500 for class I, and ≥5000 for class II loci. A read length of 200 bp or greater was a prerequisite for passing QC criteria, and together with additional quality metrics (read quality, noise ratio, consensus phasing, allele imbalance, crossmapping reads, mismatch count) assured the accuracy and confidence of allele assignments. The minimum exon/intron coverage threshold supporting the consensus sequence at the weakest position was set at ≥30 reads.
Statistics. Allelic frequencies were determined by direct counting. Arlequin version 3.5.2.2 49 was used to calculate expected and observed heterozygosity, exact deviations from Hardy-Weinberg equilibrium (a modified version of the Markov-chain random walk algorithm described by Guo and Thomson, 10 6 steps in Markov chain, 10 5 dememorization steps) 50 , and maximum-likelihood haplotype frequencies (an iterative Expectation-Maximization algorithm, convergence criterion ε = 10 −7 , maximum number of iterations = 1000, 50 random initial conditions) 51 . A series of linkage disequilibrium (LD) measures (D' 52,53 , Wn 54 ) was provided for each pair of loci by using the Pypop 0.7.0 software 55 . The empirical P-values were obtained by permutation testing (1000 randomizations).