The Human Leukocyte Antigen Locus and Rheumatic Heart Disease Susceptibility in South Asians and Europeans

Rheumatic heart disease (RHD), an autoinflammatory heart disease, was recently declared a global health priority by the World Health Organization. Here we report a genome-wide association study (GWAS) of RHD susceptibility in 1,163 South Asians (672 cases; 491 controls) recruited in India and Fiji. We analysed directly obtained and imputed genotypes, and followed-up associated loci in 1,459 Europeans (150 cases; 1,309 controls) from the UK Biobank study. We identify a novel susceptibility signal in the class III region of the human leukocyte antigen (HLA) complex in the South Asian dataset that clearly replicates in the Europeans (rs201026476; combined odds ratio 1.81, 95% confidence intervals 1.51–2.18, P = 3.48×10−10). Importantly, this signal remains despite conditioning on the lead class I and class II variants (P = 0.00033). These findings suggest the class III region is a key determinant of RHD susceptibility offering important new insight into pathogenesis while partly explaining the inconsistency of earlier reports.

interpreted with considerable caution, given the variable genotyping approaches, small sample size, limited quality control and confounding due to genetic ancestry 11 . Across the pre-GWAS reports, there are no clear examples of the same HLA allele being associated with susceptibility in two or more studies 11,12 . Additionally, the specific classical alleles that best explained the Australian signal were not reported in any of the candidate gene studies 11 .
In contrast to the Australian study, however, our Oceanian study found negligible signal in the HLA complex 9 , a surprising finding given the putative role for HLA in the disease's pathogenesis 6 . While we cannot be certain, it is possible this result represents a false negative, although it is notable the study was adequately powered to detect the large effect sizes that have been reported previously 11 . We speculate, therefore, that the negative result might be attributable to the substantial genetic heterogeneity within the study population, which could have diluted out a HLA signal, in which the underlying causal variants occurred on distinct background haplotypes in each of the ancestral groups. On balance, while we consider it highly likely that HLA variants contribute to RHD susceptibility, there is a clear need to clarify the causal variants of these association signals.
Here we report a GWAS of susceptibility to RHD limited to the South Asian population, motivated by the substantial burden of RHD within this region and the need to refine the HLA and other genetic signals previously associated with this disease. We identify in the South Asians a novel susceptibility signal in the class III region of the HLA complex that clearly replicates in a follow-up analysis of an independent European dataset derived from the UK Biobank, a resource selected for study on the basis of robust HLA reference data. Importantly, we show the class III signal remains apparent despite conditioning on the lead variants in class I and class II, suggesting that at least one underlying causal variant is situated in the class III region. This finding is significant, not only because of the numerous immunologic genes, including complement components, located in the class III region, but also since it likely goes some way to explaining the inconsistency of earlier reports.

Results
Genome-wide analysis. In total, 854 Northern Indians (510 cases and 344 controls) and 309 Fijian Indians ). The top variant (rs201026476) in this region, with an imputation information metric score of 0.86 for the Fijian Indians and 0.87 for the Northern Indians, had a MAF of 0.15, and each copy of the minor allele was associated with a two-fold increased risk of disease (odds ratio, OR, 1.99, 95% confidence intervals, CI, 1.58-2.51, P = 7.45×10 −9 ). The second and third strongest signals were found in the class I (HLA-B, rs3819306, P = 1.91×10 −7 ) and class II (HLA-DQB1, rs28724238, P = 7.77×10 −7 ) regions, respectively.
To further define this signal, we performed stepwise conditional analyses by adding the dose of each associated allele as a covariate to the model (see Supplementary Fig. S4). After conditioning on the class III signal, the strongest signal (rs3819306) was located in HLA-B (OR 1.39, 95% CI 1.20-1.61, P = 1.83×10 −5 ; see Supplementary  Fig. S4b). However, conditioning on the lead SNPs in HLA-B and HLA-DQB1, the lead SNP in class III remained associated with susceptibility (P = 0.00026) suggesting an independent effect (see Supplementary Fig. S4c). The previously reported rs9272622 10 was not associated with susceptibility (P LMM = 0.28).
To validate our findings, we examined the HLA locus in the European UK Biobank dataset (150 cases of mitral stenosis and 1309 controls; see Supplementary Table S1), combining the resulting association statistics from this analysis with those from the two South Asian populations (Fig. 1). The peak SNP in class III was associated with susceptibility in the UK Biobank data in the same direction (rs201026476, OR 1.54, 95% CI 1.14-2.10, P LMM = 0.0057), with a combined effect size that was consistent with the discovery analysis (OR 1.81, 95% CI 1.51-2.18, P = 3.48×10 −10 ; Fig. 1a). The variant located in intron 4 of HLA-DQB1 (rs28724238, OR 1.75, 95% CI 1.42-2.15, P = 1.73×10 −7 ) also replicated (P LMM = 0.017), as did the HLA-B signal, although in the combined analysis, the signal peaked at a SNP (rs9405084) located 1,286 base pairs upstream of HLA-B (OR 1.36, 95% CI 1.19-1.55, P = 3.39×10 −6 ).
The conditional analyses followed a similar pattern, although after conditioning on the top class III SNP, the strongest signal (rs432375, P = 6.40 × 10 −5 ) was located 2,290 base pairs upstream of HLA-DOA, a HLA class II alpha chain paralogue, rather than at HLA-B (Fig. 1b). However, the class I signal remained apparent, with the lead SNP a coding variant within exon 1 of HLA-B (rs1050462, P = 0.00027). After conditioning on both rs9405084 (class I) and rs28724238 (class II), the class III signal was again maintained (rs201026476, P = 0.00033; Fig. 1c).

HLA imputation analysis.
To further understand the potential functional variants across the HLA region, we imputed classical HLA alleles and amino acid polymorphisms at class I and class II loci. Using the T1DGC reference panel, a reasonably high proportion of variants were accurately imputed based on the R 2 metric (proportion variants with R 2 > 0.80: Fijian Indian, 91.74%; Northern Indian, 91.52%; European UK Biobank, 96.16%). For comparison, when using the Pan-Asian reference panel, imputation accuracy was significantly lower (proportion variants with R 2 > 0.80: Fijian Indian, 73.10%; Northern Indian, 71.10%).
The strongest allelic signal in the class II region in the South Asian analysis mapped to the HLA-DQB1*03:03 allele (OR 1.90, 95% CI 1.  Overall, there was limited signal at the classical alleles and amino acids linked to susceptibility in the Australian study mentioned above, although we did observe an effect at the coding change at position 38 of HLA-DQB1 in the same direction (OR 0.87, 95% CI 0.76-0.99, P = 0.031; see Supplementary Table S3). Interestingly, however, HLA-DQB1*03:03 was the classical allele, with MAF > 0.5%, most associated with a self-reported history of rheumatic fever risk in a study by 23&Me (OR 1.28, 95% CI 1.05-1.55, P = 0.017) 13 , with an effect consistent in size and direction (combined OR 1.45, 95% CI 1.24-1.69, P = 4.05 × 10 −6 ; see Supplementary Fig. S6b).
Finally, as in the SNP-based GWAS, the signal in class II was linked to the class III signal, while the class I signal was independent. After conditioning on the class III SNP, the strongest signal was HLA-B*40:06 (P = 0.0021; Fig. 2b). Conditioning on both the class I and class II SNPs, only a marginal signal remained at HLA-A*02:11 (P = 0.0096; Fig. 2c).

Discussion
In the first genome-wide association study of RHD to be reported outside of the Australia-Pacific region, we have resolved a complex HLA signal into its component parts. We have shown that a single HLA signal overlapping the class III region most likely comprises at least two independent coding or regulatory effects across the class I, II and III loci. While most studies to date have focused on the relationship between classical HLA alleles and susceptibility, our data suggest these signals are in fact more complex and cannot be attributed to the classical alleles alone. Indeed, based on annotations in Ensembl 14 , the effect of Thr185Ile in HLA-DQB1, as an example, is much more likely to be regulatory than coding, not least because it shows a strong negative association with expression of HLA-DQB1 itself 14 . Similarly, the independent lead class III variant (rs201026476), situated in the 3 prime UTR of the PBX2 (Pre-B-cell leukaemia transcription factor 2) gene has regulatory annotations and thus For each locus, the negative common logarithm of the P value from LMM analysis is plotted with two-digit alleles to the left and four-digit alleles to the right defined by HLA imputation using SNP2HLA software with the T1DGC reference panel. www.nature.com/scientificreports www.nature.com/scientificreports/ could impact expression of one or more of the numerous immunologic genes including complement components located in the class III region.
While the role of HLA polymorphism has long been suspected, there remains some doubt about the roles that individual alleles play in disease susceptibility across populations. Importantly, our analysis represents the first time HLA signals for RHD have been demonstrated with consistent direction and effect size in more than one ancestral group. Moreover, the signal at HLA-DQB1*03:03 in the 23&Me study, although based on self-reported rheumatic fever 13 , adds further weight to our findings. That our results differ from those reported in the Australian study 10 is unsurprising, given there are likely to be substantial differences between the HLA loci of South Asians and Aboriginal Australians. Added to this, there were also a number of methodological differences, including the software employed for HLA imputation and linear mixed model analysis, which may exacerbate any disparity. Nonetheless, it is reassuring that both studies observed a signal at the coding change at position 38 of HLA-DQB1, raising the possibility that the two studies are tagging the same underlying causal variants. As noted above, we observed negligible HLA signal in our study set in Oceania including, beyond the Fijian Indian subgroup, the specific variants that associate with susceptibility in this South Asian analysis. Indeed, it may be difficult to fully unravel the contribution of HLA to RHD susceptibility in individuals of Oceanian ancestry until further HLA data are generated from these populations, enabling HLA imputation with a population-specific reference panel. Accordingly, we have begun efforts to develop such a panel by HLA typing a subset of our samples from individuals with Oceanian ancestry. Relating our findings to the HLA signals reported before the GWAS era is more difficult, not least because of the marked inconsistencies and the limitations of the studies themselves in addition to true geographical and ancestral differences. Interestingly, the presence of a signal in the class III region, which could have been differentially tagged in earlier studies, goes some way to explaining the inconsistencies of previously reported HLA associations.
This study has a few limitations. First, in comparison to some contemporary GWAS, our total sample size is relatively modest, and coupled with the small sizes of the individual subgroups (in particular the Fijian Indians), it is likely many variants with smaller effects will go undetected until larger collections are assembled. Nonetheless, our study was well powered to detect the vast majority of large effect variants reported in the candidate gene era 11 . Second, within Fiji, members of the general population were recruited as controls; these individuals did not undergo echocardiograms and therefore it is possible to have included a small number of undiagnosed cases of RHD. However, the prevalence of definite RHD among Fijians of Indian descent has been estimated at 3.6-4.4 cases per 1,000 15,16 such that the impact of misclassification should be minimal. There may also be shortcomings associated with using the UK Biobank study, for there were no echocardiographic diagnoses available. However, specificity is likely to be regained by limiting the analysis to the mitral stenosis subgroup, an approach that is somewhat validated by the consistent replication of the South Asian signals.
Third, the genotyping array, containing ~230,000 variants following QC, was not very dense and contained, in comparison to other genotyping arrays, a limited number of variants within the HLA region. Despite this, overall HLA imputation accuracy was high when using the T1DGC reference panel. Imputation accuracy is highly dependent upon the reference panel used and as such, we have so far deliberately limited these analyses www.nature.com/scientificreports www.nature.com/scientificreports/ to the South Asians and Europeans for whom there are reasonable reference panels available. Fourth, this report is focused on the HLA locus because it was the only region of the genome that reached genome-wide significance in the South Asian analysis. Efforts are underway to combine these and other datasets in a genome-wide meta-analysis, facilitating follow-up of other regions, such as the immunoglobulin heavy chain locus 9 . Finally, at this stage, we cannot resolve the genetic determinants of sub-phenotypes, such as specific valve lesions, disease progression or complications, these are issues which larger-scale collaborative datasets should begin to tackle.
In summary, we report a major susceptibility locus for RHD in the HLA region, likely comprising at least two underlying causal variants, which strongly associates with susceptibility to RHD in South Asians and Europeans. These findings add substantially to the knowledge of the role of HLA polymorphism in susceptibility to this devastating and neglected disease. This not only has important ramifications for understanding the immunogenetic basis of the disease process, but also offers important new insight into pathogenesis.

Methods
Sample collections. For the South Asian analysis, genetic material was obtained with informed consent from cases and controls recruited to two distinct studies. Specifically, we expanded an existing collection in Northern India [17][18][19][20][21] , and we used samples from our existing collection of Pacific Islanders 9 , specifically the Fijians of Indian descent. Cases of RHD were defined on the basis of: a history of valve surgery for RHD, a definite RHD diagnosis by echocardiography, or borderline RHD diagnosis by echocardiography with prior acute rheumatic fever 22 .
In India, adults with incident or prevalent RHD were recruited as cases from a single large referral hospital, the Sanjay Gandhi Postgraduate Institute of Medical Sciences (SGPGIMS), Lucknow, Uttar Pradesh; recruitment was limited to patients with an echocardiographic diagnosis of RHD 22 . Controls were recruited based on normal echocardiograms and the absence of prior family history of rheumatic fever [17][18][19][20][21] . In total, DNA samples were obtained from 543 cases and 397 controls. Ethical approval for use of all samples obtained in India was granted by SGPGIMS, as well as the Oxford University Tropical Research Ethics Committee (OxTREC), and all experiments on these samples were performed in accordance with the relevant guidelines and regulations.
In Fiji, children and adults with incident or prevalent RHD were recruited as cases from either the Colonial War Memorial Hospital in Suva, or the Lautoka General Hospital in Lautoka, while members of the general population were recruited as controls, following the approach of the Wellcome Trust Case Control Consortium 23 . Accounting for approximately one third of the population, Fijians of Indian descent are a South Asian population who first came to Fiji from India in the 1870s under the British indentured labour scheme 24 . In total, DNA samples were obtained from 598 cases and 913 controls; 9 of these, 170 cases and 158 controls were of Fijian Indian ancestry. Ethical approval for use of all samples obtained in Fiji was granted by the Fiji National Health Research Committee and the Fiji National Research Ethics Review Committee, as well as OxTREC, and all experiments on these samples were performed in accordance with the relevant guidelines and regulations.
Array genotyping and quality control. We obtained genetic material by sampling peripheral blood in both Fiji and India. Blood samples collected in India were stored in EDTA and frozen at −20 °C until transport to the laboratory facilities at Babasaheb Bhimrao Ambedkar University, Lucknow. Upon arrival, samples were stored at −80 °C until extraction using standard salting out procedures. Extracted DNA was prepared for analysis at the Wellcome Centre for Human Genetics (UK). The handling of blood samples collected in Fiji has previously been described, although it is noteworthy that a proportion of these samples underwent genome-wide amplification due to low DNA concentration 9 . From both collections, 1,268 DNA samples were genotyped at the Oxford Genomics Centre at ~300,000 variants using the HumanCore-24 BeadChip (Illumina Inc., USA). The resulting data were aligned to the forward strand of the Genome Reference Consortium Human Build 37.
After identifying and removing duplicated variants, the South Asian data was divided into two populations: Fijian Indian (n = 328) and Northern Indian (n = 940). We employed standard approaches to quality control (QC) the genotyping data 25 , with most steps performed using PLINK version 1.9 26 (see Supplementary Fig. S1; Supplementary Fig. S2).
Genome-wide imputation, association testing and meta-analysis. Imputation of genotypes not present on the array or missing was performed using the 1000 Genomes Project phase 3 reference panel 27 . We prephased the variants that had passed QC using SHAPEIT version 2 (r644) 28 before performing genome-wide imputation using IMPUTE2 software 29 , excluding imputed SNPs with an information metric ≤0.4, and a minor allele frequency (MAF) ≤5%.
Genome-wide association analysis for the RHD phenotype was performed using a linear mixed model, as implemented in GCTA 1.24.4, which minimises confounding due to population structure, admixture and cryptic relatedness 30 . Additionally, genotypic sex was coded as a covariate for each population, as was sample type (non-amplified or whole genome amplified) for the Fijian Indians and genotyping batch for the Northern Indians. We assessed confounding using quantile-quantile plots and the test statistic inflation factor (λ), and used the accepted threshold for genome-wide significance (P < 5×10 −8 ) 31 . Having estimated effect sizes by transformation 32 , we combined the resulting association statistics by genome-wide meta-analysis using inverse-variance-weighted fixed effects, as implemented in METASOFT 33 . Regional association plots, based on those drawn by the widely used LocusZoom software 34 , were generated for the data. We collated available data from published GWAS, including the Australian study 10 and a study containing over 200,000 23&Me research participants of European ancestry, of which 1,115 were cases of self-reported rheumatic fever 13 . (2020) 10:9004 | https://doi.org/10.1038/s41598-020-65855-8 www.nature.com/scientificreports www.nature.com/scientificreports/ HLA imputation analysis. HLA imputation was performed using SNP2HLA 35 , a software package that imputes classical HLA alleles and amino acid polymorphisms at class I (HLA-A, -B and -C) and class II (-DPA1, -DPB1, -DQA1, -DQB1 and -DRB1) loci from SNP data using the Type 1 Diabetes Genetics Consortium (T1DGC) reference panel. The T1DGC reference panel contains 5,868 SNPs and 4-digit classical HLA types for the eight loci listed above for 5,225 unrelated individuals of European ancestry. For comparison, HLA imputation was also performed with the Pan-Asian reference panel (n = 530) 36 ; this comprises several underlying datasets with ancestry including: Singapore Chinese 37 ; Chinese, Indian and Malaysian 37 ; and Japanese and Han Chinese from the Phase II HapMap 38 . Association analyses mirrored those for the genotyping data using the imputed dosage data, rather than best-guess genotypes, but excluded alleles or amino acids with imputation accuracy R 2 ≤ 0.3. conditional analysis. To identify secondary association signals, conditional association analyses in the SNP-based GWAS and the HLA region were performed with linear mixed models, as implemented in GCTA 1.24.4, using the same covariates as previously mentioned. Within the genome-wide dataset, we first identified the most strongly associated SNP following meta-analysis and performed stepwise iterative conditional regression, adding the dose of the associated SNP as a covariate to the model, to identify other independent signals. We also identified the most strongly associated HLA class I and class II SNPs within this same dataset and performed iterative conditional regression, adding the dose of each associated SNP as a covariate to the model, to identify additional independent signals. Conditional analyses in the HLA region were also performed by adding the dose of each of the previously mentioned SNPs as covariates to the model to see if there were additional signals attributable to HLA alleles or amino acids at each HLA locus.
Replication analysis. The replication analysis was based on the UK Biobank study, which contains genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom 39 . For the purpose of replication, we used mitral stenosis as a surrogate for RHD. Broadly, with the UK's low prevalence of RHD, most diagnostic codes indicating RHD will represent other forms of valvular heart disease 40 . In contrast, codes indicating mitral stenosis, which is now a rare finding in the UK population 41 , are substantially more likely to indicate underlying RHD 40 , as the majority of mitral stenosis cases have underlying rheumatic aetiology 22,[42][43][44][45] . Cases were therefore defined by self-report of mitral stenosis at enrolment or an International Statistical Classification of Diseases and Health Related Problems 10 th Revision (ICD-10) code for rheumatic mitral stenosis (I05.0, rheumatic mitral stenosis; I05.2, rheumatic mitral stenosis with insufficiency) as a primary or other diagnosis in the hospital episode statistics or on a death certificate. Controls were selected from the remainder of the cohort, matched by age, ethnicity, deprivation index, birth outside the UK and recruitment centre, at a ratio of 1:10, beyond which the performance of linear mixed models deteriorates 46 . In total, we identified 196 cases and 1919 controls, of which 150 cases and 1309 controls were defined as Caucasian (i.e. European) by UK Biobank investigators (see Supplementary Table S1 online). These individuals had previously been genotyped at ~800,000 variants using the UK Biobank Axiom Array (Affymetrix, USA). These data were quality controlled by removal of individuals with missing rate > 2% and variants with missing rate > 1%, MAF < 5% or Hardy-Weinberg equilibrium (HWE) P < 1.0×10 −9 . The remaining preparation of the data, including genome-wide and HLA imputation, and the association analyses, mirrored the process used in the South Asian samples.

Data availability
Genotype and phenotype data underlying the manuscript have been deposited in the European Genome-phenome Archive under accession numbers EGAS00001001881 (Fijian Indian data) and EGAS00001003565 (Northern Indian data). Some restrictions on access and usage apply and use is restricted to research focused on RHD.