Introduction

Immunoglobulins (IGs) are critical protein components of the immune system with roles in both innate and adaptive immune responses [1]. IGs are produced by B lymphocytes and are either expressed on the cell membrane as B cell receptors (BCRs) or secreted as antibodies. IGs are composed of two pairs of identical ā€˜heavyā€™ chains and ā€˜lightā€™ kappa or lambda chains, encoded by genes located at three loci in the human genome: the IG heavy chain locus (IGH) at 14q32.33, and the IG lambda (IGL) and kappa (IGK) loci, located at 22q11.2 and 2p11.2, respectively [2]. Through the unique mechanism of V(D)J recombination [3], individual variable (V), diversity (D) and joining (J) genes at the IGH locus, and V and J genes at either the IGK or IGL loci, somatically rearrange to generate V-D-J and V-J regions that encode respective antibody variable domains. IG loci are enriched for large structural variants (SVs), including insertions, deletions, and duplications of functional genes, many of which are variable between human populations [4,5,6,7]. The IG loci exhibit extensive haplotype diversity and structural complexity, which contributes to challenges associated with IG haplotype characterization using standard high-throughput approaches [4, 6,7,8,9]. This has limited our understanding of the impact of IG genetic variation in immune function and disease [10,11,12].

A unique structural characteristic of the IGK locus is that it includes two large inverted segmental duplications (SDs) that contain distinct V genes; these paralogous regions (termed proximal and distal) [13, 14] are separated by an array of 45-Kbp highly identical repeats that encode the 45ā€‰S rRNA, totaling 1067285 bp [15]. The proximal region spans 500 Kbp and includes 69 IGKV functional/ORF genes and pseudogenes, downstream of which are 5 functional IGKJ genes and a single functional IGK constant gene. The distal region spans 430ā€‰Kbp and includes 62ā€‰V genes and pseudogenes (distal V genes are denoted by a ā€˜Dā€™; for example, IGKV1D-12) [2]. An additional unmapped IGKV gene has also been reported [16] and is cataloged in the International IMmunoGeneTics Information System (IMGT) [17] as IGKV1-NL1 (where the NL indicates ā€œnot localizedā€). The documentation of this gene indicates the presence of SVs in IGK, consistent with observations in IGH and IGL [4, 6, 7, 14]. Evidence for sequence conversion between proximal and distal regions has also been reported [14], highlighting the potential for complex genetic signatures to be observed at the population level.

To date, there are 108 functional/ORF IGKV, IGKJ, and IGKC alleles curated in IMGT. However, only a limited number (nā€‰=ā€‰4) of fully curated and annotated haplotypes have been characterized at the genomic level [13, 14, 18], and adaptive immune receptor repertoire sequencing (AIRR-seq) has indicated that allelic variation in IGK is likely to be more extensive than what is currently documented [19, 20]. Compared to IGH [4, 6] and IGL [7], our understanding of germline variation at the IGK locus is limited, especially in terms of haplotype structure (including structural variation) as well as coding and non-coding polymorphism. It will be critical to further catalog this genetic variation if we are to clarify the impact of IG germline variants on the antibody response. Diversity in antigen-naĆÆve B cell antibody repertoires is due to combinatorial and junctional diversity and the effective pairing of particular heavy and light chain genes [1]. This partly ensures that the immune system is able to recognize and mount effective immune responses against a broad range of potential antigens. Critically, however, population-level IG haplotype and allelic variation also makes important contributions to antibody diversity and function [6, 11, 21, 22], including in the context of infection and vaccine responsiveness. For example, there have now been clear demonstrations that IGH locus SVs and single nucleotide variants (SNVs) impact usage of IGH genes in expressed antibody repertoires [6]. Antibody light chains also contribute to antigen binding and can limit self-reactivity through the process of light chain receptor editing during B cell development (reviewed in [23]). Consistent with findings related to IGH diversity, usage of specific IGKV genes has been associated with neutralizing antibodies against protein components of influenza A (IGKV4-1, IGKV3-11, IGKV3-15; [24, 25]), HIV-1 (IGKV3-20, IGKV1-33; [26]), Zika virus (IGKV1-5; [27]), as well as auto-reactivity in celiac disease (IGKV1-5; [28]) and systemic lupus erythematosus (IGKV4-1; [29]). Taken together, these findings further motivate a need to develop more robust methods for detecting and genotyping IGK polymorphisms.

To characterize germline IGK sequence variation at population-scale, we extended a technique that employs targeted long-read sequencing to generate haplotype-resolved assemblies of IGK proximal and distal regions [6,7,8]. Here, we characterize IGKC, IGKJ, and IGKV alleles in 36 individuals from the 1000 genomes project (1KGP) [30]. We report 67 novel IGKV alleles and provide evidence of inter-population variation in IGKV allele frequencies. We identify a SV that includes a previously unlocalized IGKV gene (IGKV1-NL1) and provide evidence that this SV is common across populations. In addition, we report a novel large inversion that involves both proximal and distal region genes, and validate a previously described gene conversion event [14], which we also show is associated with haplotype structure in the IGK distal region.

Material and methods

Sample information

Samples used in this study were previously collected as part of the 1KGP [30]. Genomic DNA isolated from lymphoblastoid cell lines (LCLs) were procured for all individuals from the Coriell Institute for Medical Research (https://www.coriell.org/; Camden, NJ). Sample population, subpopulation, and sex are reported in Table S1.

Construction of a custom reference assembly for the IGK locus

The GRCh38 assembly gap between the IGK proximal and distal regions was modified as follows: 10 Kbp of sequence in the GRCh38 assembly upstream of IGKV2-40 and downstream of IGKV2D-40 was replaced with homologous sequence in the T2T assembly [15]; the remaining 190,249 bases in the GRCh38 assembly gap were N-masked. 24,729ā€‰bp in a hifiasm-generated assembly for the sample HG02433 were inserted between IGKV1D-8 and IGKV3D-7; 10 Kbp of sequence flanking the insertion in this assembly was swapped with homologous sequence in GRCh38. An example of a haplotype harboring this insertion mapped to GRCh38 is shown in Fig. S1.

Coordinates of IGK gene features (L-Part1, Intron, V-exon, RSS) were determined using the program ā€˜Diggerā€™ (https://williamdlees.github.io/digger/_build/html/index.html) and will be made available, along with the custom reference, at https://vdjbase.org/.

IGK sequencing and phased assembly

To enrich for DNA from the IGK locus, we designed a custom Roche HyperCap DNA probe panel (Roche) that includes target sequences from the GRCh38 IGK proximal (chr2:88,758,000-89336429) and distal (chr2:89846189-90,360,400) regions. Genomic DNA samples were prepared and sequenced following previously described methods [6,7,8]. Briefly, genomic DNA was sheared to ~16 Kbp using g-tubes (Covaris, Woburn, MA, USA) and size-selected using the Blue Pippin system (Sage Science, Beverly, MA, USA) using the "high pass" protocol to select fragments greater than 5 Kbp. Size-selected DNA was ligated to universal barcoded adapters and amplified, then individual samples were pooled in groups of six prior to IGK enrichment using the custom Roche HyperCap DNA probes. Targeted fragments were amplified after capture to increase total mass, followed by SMRTBell Express Template Prep (Kit 2.0, Pacific Biosciences) and SMRTBell Enzyme Cleanup (Kit 2.0, Pacific Biosciences), according to the manufacturerā€™s protocol. Resulting SMRTbell libraries were pooled in 12-plexes and sequenced using one SMRT cell 8ā€‰M on the Sequel IIe system (Pacific Biosciences) using 2.0 chemistry and 30ā€‰hour movies. Following data generation, circular consensus sequencing analyses were performed, generating high fidelity (ā€œHiFiā€, >Q20 or 99.9%) reads for downstream analyses. Reads aligned to our custom reference were input to bamtocov [31] for coverage analysis (Fig. S2).

HiFi reads were processed using IGenotyper [8]; the programs ā€œphaseā€ and ā€œassembleā€ were run with default parameters to generate phased contigs and Hifi read alignments to the custom IGK reference assembly. HiFi reads were also used to generate haplotype-phased de novo (i.e. reference-agnostic) assemblies using Hifiasm [32] with default parameters. For each sample, Hifiasm assembly contigs were concatenated into a single FASTA file, then redundant contigs (i.e. a contig with 100% sequence identity to a second contig) were filtered out to retain only unique contigs using the program ā€˜seqkit rmdup --by-seq <hifiasm_contigs.fastaā€‰>ā€‰ā€™ (seqkit v2.4.0) [33]. Hifiasm assembly contigs were mapped to the custom IGK reference using minimap2 (v2.26) with the ā€˜-x asm20ā€™ option. For each sample, aligned HiFi reads as well as aligned IGenotyper-generated contigs and Hifiasm-generated contigs were viewed in the Integrative Genomics Viewer (IGV) application [34] for manual selection of phased contigs. Where a Hifiasm and an IGenotyper contig were identical throughout a phased block, the Hifiasm contig was selected (example: Fig. S3). Examples of contig selection are provided in Figs. S3-S7. Contigs were evaluated for read support from mapped HiFi reads (example shown in Fig. S4), and contigs harboring one or more SNVs that lacked read support were not selected during manual curation.

Curated, phased assemblies were aligned to the custom IGK reference using minimap2 [35] with the ā€˜-x asm20ā€™ option. Phased assembly coverage across IGK (Fig. S8) was computed using ā€˜bedtools coverageā€™ (bedtools v2.30.0) [36]. Average IGenotyper and Hifiasm contig lengths across IGK (Fig. S2) were computed by extracting contig lengths using ā€˜samtools viewā€™.

To assess accuracy of manually curated assemblies, IGK-personalized references were first generated by N-masking the IGK locus of our custom reference (chr2:88837161-90280099) and, for each sample, appending the reference FASTA file with IGK curated assemblies (i.e. all IGK contigs). HiFi reads from each individual were aligned to the corresponding IGK-personalized reference using minimap2 with the ā€˜-x map-hifiā€™ preset. Positions in IGK assemblies with > 25% of aligned HiFi reads mismatching the assembly were identified by parsing the output of samtools ā€˜mpileupā€™ using a custom bash script; assembly accuracy was determined using the formula [total bases without a mismatch / total (diploid) assembly length (bp)] * 100 = % accuracy (Table S2).

Genotyping and analysis of IGK assemblies

Sample IDs were inserted into the @RG tag of curated assembly BAM files using the ā€˜samtools addreplacergā€™ command (samtools v1.9) [37]. Variants were called from assemblies using the ā€˜bcftools mpileupā€™ (bcftools v1.15.1) command with options ā€˜-f -B -a QSā€™ and a ā€˜--regionsā€™ (BED) file corresponding to the IGK locus (proximal: chr2:88866214-89331428, distal: chr2:89851190-90265885), then the ā€˜bcftools callā€™ command with the ā€˜-mā€™ option. Multiallelic SNVs were split into biallelic SNVs using the ā€˜bcftools normā€™ command with options ā€˜-a -m-ā€™. A mutli-sample VCF file was generated using ā€˜bcftools mergeā€™ with the ā€˜-m bothā€™ option, and the INFO field annotations for V-exon, introns, L-Part1, RSS, and intergenic sequences were added using vcfanno [38]. The VCF file was filtered to include SNVs with an alternate allele count > 1 (MAFā€‰ā‰„ā€‰2.86%) using ā€˜bcftools viewā€™ with the ā€˜-v snps -i ā€˜INFO/ACā€‰>ā€‰1.0ā€™ optionsā€™.

The filtered VCF file was input for principal component analysis (PCA) of SNVs in the IGK distal region (chr2:89852177-90266726) using the SNPRelate package [39] command ā€˜snpgdsVCF2GDSā€™ with the option ā€˜method = ā€œbiallelic.onlyā€ā€™ followed by the ā€˜snpgdsPCAā€™ command with default parameters. The SNPRelate command ā€˜snpgdsIBSā€™ was used to calculate a pairwise dissimilarity matrix from distal region SNVs, followed by hierarchical clustering using the ā€˜snpgdsHClusterā€™ function. The outputs of these functions were input to the functions ā€˜snpgdsCutTreeā€™ and ā€˜snpgdsDrawTreeā€™ to generate a dendrogram.

A parsable genotype table was generated from the filtered VCF using ā€˜bcftools queryā€™ and read into R for analysis. SNV densities over genomic intervals were calculated using the formula: SNV density = [sum of alternate alleles over interval / interval length (bp) * 2]. 10 Kbp intervals along the IGK distal region were generated using the ā€˜bedtools makewindowsā€™ [36] command. dbSNP data were downloaded from the UCSC Table Browser [40] (dbSNP ā€˜common variantsā€™ release 153). The 1KGP ā€œphase 3ā€ GRCh38 chromosome 2 multi-sample VCF file (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/phase3_liftover_nygc_dir/phase3.chr2.GRCh38.GT.crossmap.vcf.gz) was filtered to include only the IGK locus (chr2:88866214-90265885) excluding the intervening gap region (chr2:89331429-89851189), split into single-sample files, filtered to include only SNVs using bcftools ā€˜viewā€™, and compared with SNVs from curated assemblies using bedtools ā€˜intersectā€™. The downloaded multi-sample VCF file included 33 of 36 samples used in this study. Both dbSNP and 1KGP coordinates were lifted over to our custom reference to account for the SV insertion.

Identification of IGKC, IGKJ, and IGKV alleles

Allele sequences at V-exons, J-exons, and C-exons were extracted from assembly BAM files and queried against all human immunoglobulin alleles in the IMGT database (downloaded 2023-09-26) using custom python scripts and tools available in the receptor_utils library (https://pypi.org/project/receptor-utils/). Custom python and bash scripts were used to generate metrics of HiFi read support for novel alleles. Briefly, for each sample, IGK assemblies were appended to our custom reference with the IGK locus N-masked to generate an IGK-personalized reference FASTA. All HiFi reads from IG-capture were mapped to the IGK-personalized reference using minimap2 with the option ā€˜-ax map-hifiā€™, then the resulting BAM file was input to samtools ā€˜mpileupā€™ with iteration over allele coordinates in the IGK-personalized reference. The outputs of this script are in columns 17-26 of Table S3 and include a column for total number of HiFi reads spanning the novel allele (ā€˜Fully_Spanning_Readsā€™) and a column for the number of HiFi reads spanning the novel allele with 100% sequence identity (ā€˜Fully_Spanning_Reads_100%_Matchā€™). A complete description of HiFi read support metrics for alleles is available at https://vdjbase.org/. Novel alleles were analyzed using IMGT/V-QUEST [41] to determine the presence of synonymous, non-synonymous, and frame-shift substitutions.

Statistical analyses

Statistical tests described in Results and Figure Legends were performed in R (v4.2.1).

Results

Assembly of the IG kappa chain locus

To sequence the IGK locus, we extended our previously described method [6,7,8] such that genomic DNA fragments derived from this locus were enriched using targeted probes and sequenced using single-molecule, real-time (SMRT) long-read sequencing. This methodology was applied to an ancestrally diverse cohort (nā€‰=ā€‰36; Table S1) to assess genetic variants in IGK across different populations. The cohort represented individuals of African (AFR, nā€‰=ā€‰8), East Asian (EAS, nā€‰=ā€‰7), South Asian (SAS, nā€‰=ā€‰10), European (EUR, nā€‰=ā€‰10), and Central American (AMR, nā€‰=ā€‰1) ancestry, spanning 18 subpopulations. Average high quality (>99% accuracy, ā€œHiFiā€) read coverage across the custom IGK reference (see Materials and Methods) was > 50X within the proximal and distal IGK regions for all samples (Fig. S2). HiFi reads were assembled into phased contiguous sequences (contigs) using both IGenotyper [6, 8], which uses reference-aligned reads to generate contigs, and Hifiasm [32], which generates a de novo diploid assembly (Fig. S1). The resulting phased contigs were assessed for read support and merged (see Materials and Methods) into manually-curated diploid assemblies (see Materials and Methods and Figs. S3ā€“S7), which on average spanned 99.3% (479.53 Kbp) and 99.7% (413.26 Kbp) of the proximal and distal regions, respectively (Fig. S8). Accuracies of manually-curated diploid assemblies were determined by alignment of HiFi reads to IGK-personalized references (see Materials and Methods) and ranged from 99.84% to 99.97% (mean: 99.95%) (Table S2). Assembly contigs could not be phased between proximal and distal regions due to the extended non-IGK sequence region not covered by our probe panel; additionally, we cannot fully account for the possibility of rare phase-switch errors present in hifiasm assemblies. However, all assemblies were locally phased over each variant position, allowing us to effectively genotype IGK gene loci and locus-wide polymorphisms. For IGK gene and allele annotation, as well as variant calling, curated assemblies were aligned to our in-house reference assembly, which included a novel SV insertion identified in this cohort (see also Materials and Methods).

Characterization of IGKV alleles

Allele sequences were annotated and extracted from haplotype-resolved sequences of all donors for 47 functional/ORF IGKV genes. These alleles were compared against those cataloged in IMGT [17], revealing both known (nā€‰=ā€‰78) and novel (nā€‰=ā€‰67) alleles (Fig. 1A, Tables S3 and S4). Of the 67 novel alleles, 23 were observed in >1 individual; 9 of the 67 alleles were also independently supported by alleles curated in VDJbase [42] (Table S3). Novel alleles were assessed for HiFi read support quantitatively, revealing a minimum of 5 HiFi reads supporting each allele with 100% sequence identity (mean: 46.7 supporting reads, range: 5-181 supporting reads; Table S3), and qualitatively by visualization in IGV of HiFi reads mapped to alleles in curated assemblies (Fig. S9).

Fig. 1: Characterization of known and novel IGK alleles in a cohort of 36 individuals.
figure 1

A Diagram of alleles for the single IGKC gene, 5 IGKJ genes, and 47 IGKV genes. Individuals (rows) are split into two sub-rows, each indicating the allele for a gene (column). Tile colors correspond to allele sequences cataloged in the IMGT database, which have numerical designations. Allele sequences absent from IMGT are considered novel and colored lime-green. Gray tiles represent allele absence due to structural variation (deletion); white tiles represent regions unresolveable due to V(D)J recombination artifacts (see Fig. S8), which we have shown previously to occur in DNA derived from some LCLs [7, 8]. Locally phased assemblies were created for the proximal and distal regions separately; individual sub-rows do not represent locus-wide haplotypes. B Bar plot of the percent of individuals in each population that are heterozygous for 47 IGKV genes. C Stacked bar plot of the number of known and novel alleles for each of 47 IGKV genes; at least one novel allele was identified for 33 of these genes. D Bar plot of the number of individuals that have at least one allele for each of 67 novel alleles. Each novel allele is colored to indicate the presence of a) only synonymous, b) at least one non-synonymous, or c) a frame-shift substitution. E Bar plot of the average number of novel alleles identified per individual for each population. F Venn diagram illustrating the distribution of 67 novel alleles among populations. G Stacked bar plot of the number of individuals (colored by population) with at least one novel allele for each of 33 IGKV genes.

Genes with a single identified allele were determined to be homozygous across all individuals; these genes included IGKV3-11, IGKV1-37, IGKV1D-43, and IGKV3D-7 (Fig. 1A, 1B). IGKV genes that were heterozygous in at least half of all individuals included IGKV5-2, IGKV1-5, IGKV1-8, IGKV1-13, IGKV1-17, IGKV3D-15 and IGKV3D-11 (Fig. 1B). Heterozygosity varied among populations for several genes, including IGKV3-15, IGKV1-27, IGKV2-29, IGKV2-30, IGKV1-39, IGKV2-40, and IGKV2D-40 (Fig. 1B).

Novel alleles were identified for 17 of 24 proximal genes and 18 of 23 distal genes (Fig. 1C) (Table S3). Among the 23 novel alleles identified in >1 individual, 10, 12, and 1 resulted in synonymous, non-synonymous, and frame-shift substitutions, respectively (Fig. 1D). Sixteen of the 67 novel alleles differed from the closest-matching IMGT allele by at least 2 nucleotides (range: 1-8 nucleotides) (Fig. S10). Thirty-three individuals (91.7%) carried at least one novel allele, with individuals of African descent harboring ~10 novel alleles per individual on average (range: 8-14) and individuals from other populations harboring ~2-3 novel alleles on average (ranges; EAS: 1-6, SAS: 0-8, EUR: 0-8) (Fig. 1E). Fifty-three (79%) of the novel alleles were exclusive to a single population, of which, 33 were identified only in AFR individuals (Fig. 1F). Fourteen of the novel alleles were identified in two or more populations (Fig. 1F).

Among all functional/ORF IGKV genes, the number of individuals harboring at least one novel allele was highest for IGKV1-13 (24 individuals), IGKV2D-26 (18 individuals), IGKV2D-40, IGKV2D-28 and IGKV3D-15 (11 individuals each), followed by IGKV1-6, IGKV2-40, IGKV1-27, and IGKV1D-42 (ā€‰>ā€‰3 individuals each) (Fig. 1G).

Single nucleotide polymorphism in the IGKV proximal and distal loci

In addition to variants within IGKV V-exon regions, we also characterized SNVs across the whole of the IGK locus. We enumerated 3,396 SNVs for which we observed >1 individual carrying an alternate allele relative to our reference (equivalent to a MAFā€‰ā‰„ā€‰2.86%). Of these, 42% and 57.8% were absent from the dbSNP ā€˜common variantsā€™ dataset in the proximal and distal regions, respectively (Fig. 2A). Sample-level comparison of SNV positions identified in this study with those from short read-derived (1KGP) data revealed that, on average, 58% of SNV positions were uniquely identified in our curated assemblies and 37.9% were identified in both datasets (Fig. 2B, Table S5), consistent with known limitations of short reads for genotyping genomic regions enriched with duplicated sequence [8, 10, 43,44,45]. SNV positions were more numerous in individuals of non-EUR ancestry, most notably among individuals of AFR ancestry (Fig. 2B).

Fig. 2: Single nucleotide variation in IGK proximal and distal regions.
figure 2

A Stacked bar plot of the number of SNVs (MAFā€‰ā‰„ā€‰2.86%) in the IGK proximal and distal regions. Bar colors indicate SNV inclusion in dbSNP. B Stacked bar plot of IGK locus SNV positions found uniquely in the 1KGP ā€œphase 3ā€ release, uniquely in curated assemblies generated in this study (long read-based assemblies), or in both, for 33 individuals. Total SNV positions are indicated above bars. C Pie charts of SNV distribution among genic (functional IGKV genes) and intergenic regions in proximal and distal regions of IGK. D Boxplots of SNV densities in IGK proximal and distal regions for each individual (data point). Gray dashed lines indicate paired SNV density values for individuals. Gray shading of data points indicates data point overlap. P-values: paired Wilcoxon tests.

We categorized the 3,396 SNVs as overlapping either intergenic sequence or an IGKV gene feature, including the leader part 1 (L-Part1), intron, V-exon, or recombination signal sequence (RSS). While the majority of SNVs were intergenic (Fig. 2C), SNV density was greater in V-exons, L-Part1 regions, and introns relative to intergenic and overall SNV densities (Fig. S11). Partitioning of SNVs into proximal and distal also revealed that, for each feature except V-exons, there was increased SNV density in the distal relative to proximal region for the majority of (but not all) individuals (Fig. 2D).

Characterization of structural variants, a gene conversion haplotype and associated SNV signatures

We analyzed our curated assemblies for evidence of large structural events. This revealed the presence of aā€‰~ā€‰24.7 Kbp insertion between IGKV1D-8 and IGKV3D-7 in a large fraction of haplotypes in our cohort (Fig. S1); this insertion is not present in GRCh38. The insertion sequence was queried against alleles from the IMGT database using BLAT [46], revealing the presence of a single functional IGKV gene, IGKV1-NL1 (IGKV1-ā€œnot localized 1ā€); this sequence was integrated into the reference assembly used for all analyses in this study (see Materials and Methods) and its absence was codified as a deletion (Fig. 3A, Table S1). The frequency of the deletion was 50% in AFR as compared to 78.6%, 90%, and 85% in EAS, SAS, and EUR populations, respectively (Fig. 3B). The distribution of this SV allele in AFR versus non-AFR groups was significantly different (Fisherā€™s exact test, Pā€‰<ā€‰0.01) (Table S6). Among 35 individuals, one EAS and two AFR samples were identified as homozygous for the IGKV1-NL1 allele (Fig. 1A).

Fig. 3: Genetic features of structural variation and gene conversion in IGKV.
figure 3

A Diagram of structural polymorphisms in the IGK locus, including a gene conversion event in which ~16 Kbp of sequence containing IGKV1-12 and IGKV1-13 replaced paralogous sequence in the distal region, described previously [14]. Also shown is an inversion spanning the proximal and distal regions from IGKV1-27 to IGKV1D-27 with a length of ~ 1.301 Mbp (see Fig. S12), and a 24.7 Kbp deletion SV that includes the gene IGKV1-NL1. B Stacked bar plot indicating the SV genotype frequencies for each population, with ā€œ1ā€ corresponding to deletion. C Stacked bar plot indicating the gene conversion haplotype frequency for each population, with ā€œ1ā€ corresponding to the conversion haplotype. D Boxplot of the SNV density difference between distal and proximal regions with samples grouped according to their genotype for the gene conversion (left) as well as population (right). SNV densities were computed for 10 Kbp windows along the proximal and distal regions from alignments of IGK proximal assemblies to the proximal region of our custom reference and, likewise, alignments of IGK distal assemblies to the distal region of our custom reference (see Materials and Methods). For each sample, the mean SNV density for the distal region was subtracted from the mean SNV density value for the proximal region. E (Top) SNV densities in 10 Kbp windows along the IGK distal region (chr2:89859172-90266726). IGKV genes in this interval are labeled along the x-axis. Each line represents a sample. Samples are grouped according to their genotype for the gene conversion. (Bottom) Genotypes (colors) at each variant position for the corresponding coordinates above. The ā€œ.ā€ symbol represents deletion. Hierarchical clustering of samples (rows) resulted in the 3 indicated clusters. F Principal component analysis of samples based on SNVs (MAFā€‰ā‰„ā€‰2.86%) in the IGK distal region. Samples are colored by population (left) or gene conversion genotype (right). EV1: eigenvector 1, EV2: eigenvector 2.

During assembly curation, we noted an unusual pattern of SNVs in one sample (HG02433, AFR) over the genes IGKV1-27 and IGKV1D-27 wherein half of each read was enriched with mismatches (Fig. S12). The portion of each read that was enriched with mismatches mapped to the paralogous region when aligned by BLAT against GRCh38, but in the opposite orientation, suggesting that such reads span the breakpoints of an inversion involving the proximal and distal regions (Figs. 3A, S12). Hifiasm generated contigs in agreement with these reads (Fig. S12). Analysis of additional haplotypes will be needed to determine the frequency and population distribution of this putative SV.

Watson et al. [14] previously reported a gene conversion event wherein aā€‰~ā€‰16 Kbp region that includes IGKV1-13 was swapped into the paralogous distal region (Fig. 3A). Relative to a reference sequence that does not contain this sequence exchange, a haplotype with the gene conversion is marked by high SNV density within aā€‰~ā€‰16 Kbp window that spans IGKV1D-12 and IGKV1D-13 (Fig. S5). Genotyping of this gene conversion event revealed that it is a common haplotype structure in the population, with a frequency ā‰„ 70% in EUR, SAS, and EAS populations, and a frequency of 37.5% in AFR populations (Fig. 3C, Table S1). The distribution of the gene conversion haplotype in AFR versus non-AFR groups was significantly different (Fisherā€™s exact test, Pā€‰<ā€‰0.01) (Table S6). In this cohort, the four individuals homozygous for the reference (non-conversion) haplotype (HG01887; AFR, HG02107; AFR, HG02284; AFR, HG00136; EUR) were also the only four individuals homozygous for IGKV1D-13*02, IGKV1D-12*02, and IGKV3D-11*02 (Fig. 1A), which may indicate that the gene conversion haplotype structure extends beyond the previously identified ~16 Kbp region.

Based on the observation that most, but not all, individuals exhibit increased SNV density in the distal region relative to proximal region (Fig. 2D), we computed the within-individual difference between distal and proximal SNV densities and then partitioned individuals by genotype for the gene conversion event (Fig. 3D). This analysis revealed that the gene conversion haplotype was associated with increased SNV density in the distal region (Fig. 3D). Accordingly, SNV density in the region spanning IGKV1D-12 and IGKV1D-13 increases with dosage of the gene conversion haplotype (Fig. 3E). We noted increased SNV density throughout an ~70 Kbp region downstream of IGKV1D-13 extending to IGKV1D-17, and also throughout an ~30 Kbp region from IGKV3D-11 to downstream of IGKV1D-43 (Fig. 3E), further suggesting that the gene conversion haplotype structure is extended beyond the ~16 Kbp region containing IGKV1D-12 and IGKV1D-13. Hierarchical clustering (Fig. 3E, Fig. S13) and PCA (Fig. 3F) based on these distal region SNVs revealed separation of individuals based on their genotype for the gene conversion haplotype. In this cohort, we observed differences between AFR and non-AFR individuals with respect to the combinations of the gene conversion genotype and SV genotype; AFR individuals were neither i) homozygous for the gene conversion and homozygous for the SV deletion, or ii) heterozygous for the gene conversion and homozygous for the SV deletion, whereas 19 of 27 (ā€‰~ā€‰70%) individuals of non-AFR ancestry possessed these combinations of genotypes (Fig. S14).

To determine associations between the gene conversion and IGKV alleles, we tested for IGKV allele frequency distribution differences between individuals of each gene conversion genotype group. This analysis revealed significant associations for 13 IGKV genes (Fig. S15; chi-square Pā€‰<ā€‰0.05). The strongest associations were for genes within the gene conversion region (IGKV1D-13, IGKV1D-12), three genes within ~50 Kbp of this region (IGKV3D-15, IGKV1D-42, and IGKV1D-8), and three genes ~160-200 Kbp away from this region (IGKV2D-29, IGKV2D-28, IGKV2D-26). Homozygotes for the gene conversion harbored only the *01 allele for IGKV1D-13 and IGKV1D-12, whereas homozygotes for the non-conversion haplotype harbored only *02 alleles for these genes (Fig. S15). These data suggest that the gene conversion haplotype associates with variation in IGKV allele frequencies in the distal region.

Inter-population IGKV allelic variation

Coding sequences in the human IGH and IGL loci have been shown to exhibit population-level allelic diversity [6, 7, 11, 23]. Although our cohort is relatively small in number, we observed that functional/ORF IGKV genes also exhibited inter-population differences in allele frequency (Fig. 4A, Table S8). In total, 10 of the 47 IGKV genes demonstrated significant skewing in allele frequencies (Fig. 4B; chi-square, FDRā€‰<ā€‰0.05). For example, the alleles IGKV1-5*04 and IGKV1-8*02 were not identified in AFR or SAS individuals, but each had an allele frequency of 0.43 and 0.1 in EAS and EUR individuals, respectively (Fig. 4A). The alleles IGKV6-21*02, IGKV2-24*02, and IGKV1-27*02 were similarly enriched in EAS individuals (Fig. 4A). These inter-population differences in allele frequency appeared to correlate with overall SNV density as six of seven EAS individuals showed SNV density patterns between IGKV6-21 and IGKV1-27 that were highly similar and, on average, distinct from AFR, EUR, and SAS populations (Fig. S16). Seven alleles were identified only in AFR individuals and also had an intra-population allele frequency of 0.25 or greater, including IGKV2-29*03, IGKV1-39*02, IGKV3-15*02, and the novel alleles IGKV2D-28_N1, IGKV1-27_N1, IGKV1D-42_N1, and IGKV2D-40_N1 (Fig. 4A, Table S7). We noted that SNV density was uniquely elevated in 75% (6/8) of AFR individuals in a region including IGKV1-39 and IGKV2-40 (Fig. S16), suggesting a relationship between haplotype structure and gene alleles in this region. The frequencies of the novel alleles IGKV2D-26_N1 and IGKV2D-29*02 were > 4-fold higher in AFR relative to each other population (Fig. 4Aā€“B). In contrast, the frequency of IGKV2D-29*01 was 0.19 in AFR and ~0.9 in EUR, SAS, and EAS populations. As noted above (Fig. 1A, Fig. S15), the *02 alleles for IGKV1D-13, IGKV1D-12, and IGKV3D-11 were homozygous in three AFR individuals and one EUR individual; there was a trend of increased frequency of these *02 alleles in AFR individuals (Fig. 4A).

Fig. 4: Inter-population IGKV allelic variation.
figure 4

A Stacked bar plot of allele frequencies for each of 47 functional IGKV genes for AFR, EAS, SAS, and EUR populations. For each gene, a color corresponds to a unique allele (detailed in Table S7). Gray color indicates allele absence due to structural variation (deletion). B Differences in allele frequency distributions among populations were determined using a chi-square test (Table S8); Benjamini-Hochberg adjusted (False Discovery Rate) P-values (Padj) are plotted as -log10(Padj) for each gene. Red points indicate Padj < 0.05.

Among non-AFR populations, only two genes showed significant inter-population variation in allele frequency, with enrichment of IGKV6-21*02 and IGKV1-27*02 in EAS relative to EUR and SAS (Figs. S17, 4A), suggesting that AFR haplotypes significantly contribute to observed variation in inter-population allele frequency distributions among IGKV genes.

Discussion

Genomic loci enriched with SDs, including the IGK locus [14, 23], lack complete and accurate information on population-level variation due to shortcomings of short-read sequencing [10, 47, 48]. Here, we used a SMRT long-read sequencing and analysis framework to characterize diploid IGK assemblies from 36 individuals of African, East Asian, South Asian, European, and Central American ancestry. These assemblies represent a significant advance, increasing the number of available curated assemblies 18-fold, from 4 [13,14,15, 18] to 72 haplotypes. We identified a common SVā€‰~ā€‰24.7 Kbp in length that includes a functional IGKV gene observed more frequently in AFR individuals relative to three other populations. One AFR haplotype in this cohort had evidence of an inversion spanning ~1.301 Mbp ~, with breakpoint-spanning reads covering IGKV1-27 and IGKV1D-27. Analysis of 47 functional IGKV genes revealed at least one novel allele for 35 of these genes, with 67 novel alleles identified in total. Ten IGKV genes showed evidence of inter-population allelic variation, including alleles unique to specific populations within this dataset. Finally, we observed a common gene conversion event that associates with a distinct IGK distal haplotype structure. This gene conversion haplotype may be a significant driver of genetic variation in the IGK distal region.

Databases of germline IG alleles are foundational for the analysis and interpretation of AIRR-seq data [48, 49] and, more fundamentally, genotyping is required to determine the impact of IG germline variation on IG gene usage in antibody repertoires [6]. To our knowledge, the data presented here represent the first survey of human IGK haplotypes in an ancestrally diverse cohort. Novel IGKV allele sequences described in this study represent aā€‰~ā€‰64% increase in IGKV alleles available in IMGT. These alleles and their associated haplotype data have been made available in VDJbase and the Open Germline Receptor Database (OGRDB) as part of an ongoing effort to build population-representative germline allele databases [42, 50, 51] that can be leveraged for applications such as AIRR-seq analysis [52]. The high prevalence of novel alleles in AFR individuals relative to the other populations in this study reflects the ongoing need to generate genomic resources that are representative of the global population [45].

Deficits in existing resources for other forms of genetic variation were also observed in this dataset. For example, of the >3,000 common SNVs identified, 49.7% were not found in dbSNP. This is likely a reflection of the fact that locus complexity has hindered the use of high-throughput genotyping methods for cataloging IGK variants. We demonstrated that for most individuals in our cohort, over half of the SNVs called using our approach were missing from the 1KGP phase 3 genotype sets. This result alone implies that genetic disease association studies in the IGK locus have likely not effectively surveyed genetic diversity in this region.

Structural variation is common in the IG loci [4, 6, 7, 14]; here, we identified a haplotype harboring an inversion spanning 15 functional/ORF genes across the IGK proximal and distal regions, as well as aā€‰~ā€‰24.7 Kbp insertion in the IGK distal region that includes the functional gene IGKV1-NL1. To our knowledge, this is the first inversion characterized in the IG loci. Higher frequency of the insertion sequence in AFR individuals is consistent with the observation that AFR populations harbor increased amounts of sequence in SD regions genome-wide [53, 54]. Inclusion of this SV in future human genome references will be important for not only describing IGKV1-NL1 alleles, but also for genotyping all SNVs in this SV that can be tested for association with immune-related phenotypes. It is notable that these were the only two SVs identified in this cohort, indicating that SVs may be less characteristic of the IGK locus relative to IGH.

It is documented that SDs provide substrates for more frequent gene conversion, in some cases leading to enrichments of SNVs [47]. We previously reported the existence of two distinct haplotype structures within ~16 Kbp of the IGK distal region; one of these haplotypes harbored a gene conversion involving sequence exchange between IGKV1D-13 and IGKV1-13 [14]. Here, we demonstrated that haplotypes harboring the gene conversion are common in the population and associate with increased SNV density (relative to the remainder of the IGK distal region) throughout ~100 Kbp ranging from IGKV1D-17 to IGKV1D-43. The frequency of both the gene conversion haplotype and the IGKV1-NL1 SV were similar in EAS, SAS, and EUR populations but distinct among AFR individuals. Whereas most individuals from other populations were grouped into clusters by the PCA based on variants in the distal region (Fig. 3F), individuals of AFR ancestry were more dispersed, suggesting the potential for greater haplotype diversity in these populations. However, given the smaller sized cohort analyzed here, analysis of additional haplotypes will be needed to determine the extent of inter-population variation in IGK proximal and distal regions.

Considering that IGH germline variants are associated with usage of IGH genes in expressed antibody repertoires [6], we hypothesize that the usage of IGK genes will also likely be associated with SNVs and SVs identified here. AIRR-seq studies focused on IGL and IGK have shown inter-individual variation in germline gene usage and alternative splicing in expressed light chain repertoires [55]. The data and analytical framework presented here lay a foundation for testing the impact of IGK locus-wide (coding and non-coding) germline variants on variation in the expressed antibody repertoire in diverse human populations. This will expand our understanding of the collective roles of IG germline variation across the IGH, IGL and IGK loci in antibody diversity and function, including potential genetic interactions between these complex gene regions and their coevolution [21, 23]. This will ultimately be critical for identifying coding alleles and haplotype variation as determinants of disease susceptibility and clinical phenotypes, including autoimmunity, cancer, infection, and vaccine efficacy [9, 11, 21, 24, 56,57,58,59,60,61,62,63,64,65,66,67,68].