Genetic variation in the immunoglobulin heavy chain locus shapes the human antibody repertoire

Variation in the antibody response has been linked to differential outcomes in disease, and suboptimal vaccine and therapeutic responsiveness, the determinants of which have not been fully elucidated. Countering models that presume antibodies are generated largely by stochastic processes, we demonstrate that polymorphisms within the immunoglobulin heavy chain locus (IGH) impact the naive and antigen-experienced antibody repertoire, indicating that genetics predisposes individuals to mount qualitatively and quantitatively different antibody responses. We pair recently developed long-read genomic sequencing methods with antibody repertoire profiling to comprehensively resolve IGH genetic variation, including novel structural variants, single nucleotide variants, and genes and alleles. We show that IGH germline variants determine the presence and frequency of antibody genes in the expressed repertoire, including those enriched in functional elements linked to V(D)J recombination, and overlapping disease-associated variants. These results illuminate the power of leveraging IGH genetics to better understand the regulation, function, and dynamics of the antibody response in disease.


Introduction 39
Antibodies (Abs) are critical to the function of the adaptive immune system, and have evolved to be one 40 of the most diverse protein families in the human body, providing essential protection against foreign 41 pathogens. The circulating Ab repertoire is composed of hundreds of millions of unique Abs 1,2 , the composition 42 of which varies considerably between individuals 1-3 , potentially explaining the varied Ab responses observed in 43 103 individuals, previously generated PBMC derived AIRR-seq data for IgM and IgG was utilized. A 19 standardized workflow was developed to process datasets generated using different protocols and sequencing 20 methods 18 (Methods). After processing, a mean of 9,038 B cell clones per repertoire was identified 21 (Supplementary Figure 2a,b). The frequency of V, D and J genes within B cell clones was calculated (i.e., gene 22 usage after collapsing sequences by clone) for each individual. Together these datasets allowed us to resolve 23 large SVs and other genetic variants, and perform genetic association analysis with variation observed in the 24 expressed Ab repertoire. 25 26 Identification of large breakpoint resolved structural variants 28 A major goal of this study was to generate a high-confidence set of genetic variants and gene alleles in IGH in 29 order to perform downstream genetic Ab repertoire association analysis (Fig. 1a). Previous reports have 30 demonstrated that SVs are common in IGH, resulting in large insertions, deletions, duplications and complex 31 events 25,[27][28][29]45 . The presence of unresolved SVs can impact the accuracy of variant detection and genotyping. 32 Thus, a key first step in the creation of genotype call sets was to breakpoint resolve and genotype SVs, which 33 allowed us to account for SVs in determining homozygous, heterozygous, and hemizygous genotypes across 34 all surveyed variants in the locus. 35 Conditional analysis identifies multiple variants associated with the usage of 81 single genes 82 Previous eQTL studies have demonstrated that multiple independent variants can influence gene expression 48 . 83 Here, we hypothesized that the usage of individual genes could be affected by multiple variants, such as 84 multiple SNVs, or a combination of variant types. To test this, we performed a conditional analysis by running 85 an additional QTL analysis in individuals homozygous for either the reference or alternate allele for the lead 86 guQTL variant of all significantly associated genes. Out of the 59 genes associated with gene usage in the IgM 87 repertoire, 55 genes were tested for additional associations. The 4 genes not tested had fewer than 50 88 individuals with homozygous reference or alternate allele genotypes. From this analysis, we identified an 89 additional variant associated with the usage of 14 genes (Supplementary Table 5). In combination with the 90 initial guQTL defined above, for 12 of these 14 genes, we observed effects of 2 SNVs, and in the remaining, 91 we observed combined effects of an SV and SNV. The mean genomic distance between the lead and 92 secondary guQTL variants was 36.2 Kbp (range = 1.7 -161.4 Kbp). Here, we present IGHV1-2 and IGHV3- 66 93 as examples of genes associated with 2 independent variants. Data for all genes is provided in Supplementary 94 Table 5.  95 For IGHV1-2, the lead guQTL was a SV ~31 Kb away from IGHV1-2 (Fig. 3a), which involved the 96 deletion of IGHV7-4-1. Individuals homozygous for the deletion used IGHV1-2 at a 2.8-fold higher rate than 97 individuals homozygous for the SV insertion allele (Fig. 3a). Conditioning on individuals without the deletion 98 identified 35 SNVs additionally associated with the usage of IGHV1-2 (Fig. 3b). Of these individuals, 99 heterozygotes for the lead conditional guQTL used IGHV1-2 at a level (mean usage = 3.8%) similar to those 00 with a deletion in both haplotypes (mean usage = 4.2%). Sequencing data from heterozygotes at the lead 01 conditional guQTL were inspected manually to confirm that IGHV7-4-1 deletions were not present in these 02

individuals. 03
For IGHV3-66, the lead guQTL was a SNV. Individuals homozygous for the reference and alternate 04 allele had a mean usage of 0.19% and 2.14%, respectively. By conditioning on this variant, considering only 05 individuals homozygous for the reference allele, a total of 438 additional SNVs were significantly associated 06 with IGHV3-66 usage (Fig. 3c). At the most significant SNV from this analysis, only reference allele 07 homozygotes and heterozygotes were observed. In heterozygotes, the mean usage was 0.006% compared to 08 0.0003% in homozygotes, with many individuals in the homozygote group exhibiting 0% usage (Fig. 3c). Thus, 09 based on this conditional guQTL analysis, variation in IGHV3-66 usage can be further explained even in 10 individuals with relatively low usage. 11 Gene by guQTL network analysis reveals that the usage of multiple genes is 12 associated with overlapping sets of variants 13 In addition to discovering multiple variants associated with the usage of a single gene, our guQTL association 14 analyses also identified single variants associated with the usage of multiple genes. This was intriguing as 15 V(D)J recombination studies in animal models have demonstrated the coordinated selection of genes through 16 the same regulatory elements 32,51 . In mice, IG V genes reside in topologically associating domains (TADs) and 17 disruption of regulatory elements within the IG loci has been shown to cause altered gene usage within these 18 domains [52][53][54] . Given this, we further assessed coordinated genetic signals involving sets of multiple variants 19 and genes. We found that 2,607 (66%) guQTL variants were associated with > 1 gene (Fig. 3d). We reasoned 20 that this could have multiple underlying causes: (1) the SNV is tagging a SV overlapping multiple genes; (2) the 21 SNV is tagging multiple causative regulatory SNVs; (3) the SNV is overlapping a regulatory element controlling 22 multiple genes; or (4) a combination of any of the prior explanations. 23 To determine the set of guQTL genes with the same set of guQTL variants, we created a network with 24 genes as nodes and edges connecting genes associated with the same guQTL SNVs ( (i.e., more than 2 SNVs connecting 2 genes). These 23 cliques included a total of 16 IGHD and 29 IGHV 28 genes, with the number of genes per clique ranging from 2 to 9. Out of the 23 cliques, 10 were primarily 29 composed of genes within SVs ( Fig. 3e; Supplementary Figure 13). 30 We also identified cliques made up primarily of genes outside of SVs (Fig. 3f). For example, the SNV 31 shown in Figs. 3g was associated with the usage of 7 genes, IGHV4-31, IGHV3-53, IGHV4-59, 32 IGHV3-64, IGHV3-66 and IGHV1-69/-69D; this variant was located ~120 Kbp away from the nearest SV, and 33 exhibited low LD with the SV (r 2 = 0.09) . Interestingly, gene usage patterns associated with this SNV were 34 either negatively or positively correlated depending on the gene. Individuals homozygous for the reference 35 allele had higher usage of IGHV4-31, IGHV3-53, IGHV4-61 and IGHV1-69/-69D and lower usage for the 36 remaining genes. In summary, we show that the usage of specific sets of genes in the repertoire are 37 associated with the same sets of variants, indicating the potential for complex and coordinated regulatory 38 mechanisms. 39 Variants associated with gene usage variation are enriched in regulatory 40 regions involved in V(D)J recombination 41 Large scale studies using expression, epigenomic and disease or trait-associated variant datasets have 42 identified non-coding variants in regulatory elements linked to their phenotypes of interest 48,[55][56][57] . Specific to 43 V(D)J recombination, recombination signal sequences (RSS) are sequence motifs in IG and T cell receptor 44 non-coding regions used by RAG1/RAG2 proteins to direct double-strand DNA breaks and initiate somatic 45 recombination 58 . Additionally, CTCF and cohesin binding has been shown to regulate contraction and 46 recombination of IGH 59-61 . We therefore hypothesized that variants might modulate gene usage through 47 additional genes were also associated with the same guQTL SNV. For KD, the SNVs detected in the GWAS 03 were also associated with IGHV1-69/-69D, IGHV3-64 and IGHV4-61 usage (Fig. 6b). Similar to using 04 expression data to prioritize genes affected by SNVs identified from GWAS, here we show that guQTL-GWAS 05 SNVs are associated with the usage of multiple genes in the Ab repertoire. Additional diseases/traits 06 associated with SNVs identified by both GWAS and our guQTL analysis included the proportion of 07 morphologically activated microglia in the midfrontal cortex, and estradiol levels, which were associated with 08 the usage of IGHV1-69/-69D and IGHV2-70D, and IGHV1-8, IGHV3-64D, IGHV3-9 and IGHV5-10-1 usage, 09 respectively (Fig. 6c). In both of these two examples, the GWAS SNVs and guQTLs were in strong LD with 10 SVs spanning these respective sets of candidate genes (r = .51 and r=.98) suggesting that the observed 11 effects could at least in part be SV-mediated. 12 Repertoire-wide gene usage profiles are more highly correlated in 13 individuals carrying shared IGH genotypes 14 Previous studies in monozygotic twins have shown that gene usage frequencies in genetically identical 15 individuals are more highly correlated than in unrelated individuals 20,21 . We reasoned that such effects could 16 also be observed at the population level by assessing correlations in individuals sharing greater versus fewer 17 IGH guQTL SNVs. To assess this, we used allele sharing distance 69,70 (ASD) to group individuals with similar 18 genotypes across IGH and compare the IgM gene usage correlation between groups. Two ASD-based 19 groupings were performed using either (1) the lead guQTL per gene (Fig. 7a), or (2) all guQTLs (Fig. 7b). We 20 tested the latter case as we noted above that multiple variants could influence a single gene, and it has been 21 shown that accounting for a greater number of common variants associated with a given phenotype can 22 explain more variation in that phenotype 71 . Repertoire-wide gene usage correlations between samples were 23 calculated using the Pearson's Correlation coefficient. Using only the lead guQTL variants for each gene, 24 individuals with the most overlapping guQTL genotypes (low ASD) had a higher mean IgM gene usage 25 correlation than those in the group with the highest ASD scores (0.958 vs. 0.943; KS test P value < 3.8e-15). 26 The same pattern was observed when using all significant variants (0.956 vs. 0.943; KS test P value = 0.008). 27 These results indicated that genetic background makes a significant contribution to the overall gene usage 28 composition of the repertoire, and expand on previous observations made in twin studies 20,21 , by demonstrating 29 that heritable components of the heavy chain repertoire can be directly linked to germline variants in the IGH 30 locus. 31 Discussion 32 In this study, we show conclusively that IGH genetic polymorphisms influence the composition of the Ab 33 repertoire through impacts on gene usage frequencies. Resolution of complex IGH genetic variants using long-34 read sequencing identified associations between these variants and gene usage within the IgM and antigen-35 stimulated (IgG) repertoire. Variants were found to affect the Ab repertoire via (1) SVs that alter IGH gene copy 36 number, including deletions that completely remove genes from the repertoire, as well as through (2) SNVs 37 and indels, including those overlapping regulatory elements and transcription factor binding sites linked to 38 V(D)J recombination. The strength of these associations was substantial, in some cases explaining >70% of 39 variance in usage of particular genes. Building on past observations from twin studies 20,21 , we found that 40 repertoire-wide gene usage patterns were more similar in individuals sharing a greater number of genotypes 41 across IGH. Together, these findings (1) advance our basic understanding of repertoire development, 42 illuminating regions of IGH involved in gene regulation, and (2) more broadly represent a paradigm shift 43 towards a model in which the Ab repertoire is formed by both deterministic and stochastic processes. This shift 44 has critical implications for delineating the function of Abs in disease, with great potential to inform the design 45 and administration of therapeutics and vaccines. 46 Resolving IG germline variants has historically been impeded by technical challenges resulting from the 47 complex and highly polymorphic nature of the IG loci. Specifically, high-throughput approaches, including 48 microarray and short-read sequencing are not able to fully and accurately resolve IG germline variation 28, 72 . 49 Long-read sequencing has proven invaluable for resolving complex genomic regions, resulting in drastic improvements in variant detection 29,73 . However, whole genome sequencing of large cohorts with long-read 51 sequencing remains costly, laborious and prohibitive in many cases. Our alternative approach, using a 52 targeted long-read protocol to selectively sequence the IGH locus in a cost-effective, multiplexed fashion, 53 allowed us to characterize a broad spectrum of genetic variants in IGH for 154 donors, providing the largest 54 long-read resolved collection of SVs, SNVs, and indels for this locus to date. 55 SVs are a hallmark of the IGH locus [25][26][27]46,74 , which was clearly supported by our analysis. We 56 breakpoint resolved 28 SV haplotypes/alleles within 8 different SV loci spanning 542 Kbp of IGH; this included 57 13 novel SV alleles, and collectively resulted in copy number changes in 6 IGHD genes and 31 IGHV genes, 58 representing 22% and 53% of all IGHD and IGHV genes in IGH, respectively. Critically, our ability to resolve 59 SVs allowed us to more comprehensively detect and genotype SNVs and indels. In total, we identified 20,510 60 unique SNVs and 966 indels, 7980 and 223 of which were common. A significant fraction of these overlapped 61 SVs (n=3,406), which we accurately genotyped as hemizygous. The increased performance of our approach 62 was demonstrated through a comparison of our callsets to dbSNP, which revealed that the majority of common 63 SNVs (63%; n=5057) detected were labeled as rare in frequency, lacked allele frequency data, or were 64 completely missing from dbSNP altogether. Additional novelty was discovered through the annotation of IGH 65 genes, revealing 135 undocumated alleles not currently curated in the germline gene database IMGT 75 . 66 Together, these data hinted at the extent of variation that we have yet to describe in this complex locus, and 67 bolster previous concerns that past genetic studies have overlooked IGH variants 28,31,76 . A major outcome of 68 this study is that these data can start to be used to augment existing resources and databases that aim to 69 provide improved reference data for the IG loci 30,77 . 70 In addition to increasing our knowledge of IGH diversity, our ability to more fully resolve polymorphisms 71 facilitated the identification of IGH germline variants that impact Ab repertoire diversity at a level that was 72 previously not possible. Identifying associations between genetic variants and gene expression is a key step in 73 determining the functional roles of germline variation in disease and clinical phenotypes, as well as resolving 74 the molecular mechanisms that underlie gene regulation 48,78 . By combining genetic variants with gene usage 75 information across IGHV, D and J genes derived from AIRR-seq data, we performed the first gene usage QTL analysis, assessing associations between 7,297 common variants and 81 genes to identify polymorphisms 77 explaining gene usage in the expressed IgM and IgG repertoire. These analyses revealed that half (52%) of 78 common variants were associated with gene usage variation, impacting 59 (73%) genes in the IgM repertoire, 79 with similar results in the IgG repertoire, indicating that patterns in IgG are likely highly influenced by the gene 80 usage composition initially established in IgM, as noted previously 20,21 . Furthermore, for 10 of the 59 genes 81 identified in our analysis (9 of which were within SVs), the most significant variant explained more than half of 82 the gene usage variation (R 2 > .5). A conditional analysis further found that for 14 out of the 59 guQTL-83 associated genes in IgM, additional variance in gene usage could be explained by secondary polymorphisms, 84 indicating that for at least a subset of IGH genes, interactions and additive effects across multiple variants will 85 ultimately need to be resolved. These collective effects of polymorphisms across the repertoire as a whole 86 were clear when we compared repertoires between individuals based on genetic similarity. As expected 20,21 , 87 we found that usage patterns were more highly correlated in individuals sharing IGH genotypes. This indicated 88 that overlapping signatures in the repertoires of different individuals may be possible to identify and 89 characterize with greater resolution at the population level by simply taking into account IGH genetic data 22 . 90 The guQTLs discovered provide the first insights into the potential functional mechanisms underlying 91 the development of the Ab repertoire in humans. First, the association between SVs and gene usage variation 92 offer a straightforward model for how germline variants impact the repertoire. Specifically, our results indicated 93 that SVs change the copy number of genes, directly modifying their usage frequency in an additive fashion, 94 likely by influencing the probability that the SV-associated genes are selected by V(D)J recombination based 95 on the number of chromosomes on which they are present. This pattern was observed for the majority of 96 genes associated with SVs in our dataset, and has been noted previously 40,42 . Interestingly, there were also 97 genes for which usage was impacted by neighboring SVs, even though the copy number of these genes was 98 not directly altered, suggesting more complex mechanisms 42 . Beyond the effects of SVs, we found a significant 99 number of SNVs associated with gene usage, all of which were in intergenic regions; again, this highlights the 00 importance of our approach for capturing all IGH variant types, beyond just coding polymorphisms. Network 01 analysis connecting genes with overlapping guQTL variants identified sets of genes whose usage patterns were coordinated; in many cases these genes were co-localized to specific regions of IGH, spanning 10's to 03 100's of Kbp. As with patterns observed for SVs, these signatures were illustrative of more complex regulatory 04 mechanisms in the IGH locus. These regional effects appear consistent with studies of V(D)J recombination in 05 model organisms. For example, the mouse IG loci partition into distinct regions, marked by specific regulatory 06 marks, including TFBS and histone modification signatures, many of which, alongside RSS variation, have 07 been associated with intra-gene V(D)J recombination frequency differences 32,79,80 . The mouse IG loci are also 08 characterized by 3-dimensional structure, TADs and sub-TADs, associated with complex interactions between 09 gene promoters and enhancers that coordinate V(D)J recombination in pre-B cells 35,52,[81][82][83] . In contrast to 10 mouse, functional genomic elements dictating V(D)J recombation in the human IGH locus have not been 11 characterized in depth; nonetheless, our intersection of guQTLs with publicly available annotation sets 12 revealed enrichments in cis-regulatory elements and TFBS involved in V(D)J recombination in animal models. 13 This included CTCF and EED TFBS, as well as IGH regions marked by H3K4me3 53,60-62 . While fine mapping 14 and functional validation of guQTLs is needed, this result was reaffirming given that gene usage in the IgM 15 repertoire is a proximal measurement of V(D)J recombination, providing initial evidence that the variants we 16 identified likely influence the frequency at which IGH genes are selected during V(D)J recombination. 17 Ultimately, an improved understanding of Ab repertoire diversity and function will be critical to resolving 18 the role of B cells in disease. This study provides support for the idea that leveraging IG genetic data can 19 better delineate Ab response dynamics in a variety of contexts. For one, there is growing interest in developing 20 predictive models for V(D)J recombination and repertoire diversity 84,85 , and applying Ab repertoire profiling as a 21 diagnostic tool for disease and clinical phenotypes of high public health relevance 86,87 . However, current 22 models do not explicitly account for genetic factors, and the effects of this on model performance are not 23 known 84,85 . Our results indicate that future work in this area should explore ways to integrate genetic data; this 24 will likely be critical for better understanding commonalities and differences in repertoire signatures, not only for 25 gene usage patterns, but also in identifying additional features (e.g., public clonotypes 1,2 ), overall leading to 26 improved metrics for immune response monitoring and prediction modeling.
Here, we demonstrate that our data already provide an opportunity to more fully explore the potential 28 roles of IGH polymorphism in Ab-mediated diseases. First, the direct overlap of GWAS SNVs and guQTLs 29 indicate the potential for effects of GWAS variants to be mediated through genetic effects on Ab gene usage. 30 This parallels approaches employed for eQTLs and GWAS variants elsewhere in the genome to nominate 31 genes/pathways underlying human phenotypes 48,88-90 . As additional disease genetic associations are made in 32 IGH, our dataset will continue to be useful for making such first-line connections, and drive the generation of 33 novel hypotheses that can be explored experimentally. Second, our results can directly inform our 34 understanding of vaccine responsiveness, particularly as this pertains to efforts centered around the elicitation 35 of targeted antibodies. Notably, our analysis revealed that IGHV coding variation was in many cases linked to 36 guQTLs, indicating that usage patterns can coincide with amino acid differences that are important for Ab-37 antigen interactions. This is consistent with previous reports 23,41,91 , including examples related to precursor 38 germline alleles critical for broadly neutralizing Abs in various infectious diseases. For example, it has been 39 shown that IGHV1-2 germline alleles associated with HIV VRC01 Abs, which are a current focus of germline 40 targeting immunogens, have variable usage frequencies in the IgM repertoire and associate with variable 41 immunogen-specific B cell frequencies 23 . Another clear example is a germline variant that encodes a critical 42 phenylalanine within the CDR2 of IGHV1-69-derived broadly neutralizing Abs against the influenza 43 hemagglutinin stem 41,64 . This variant has not only been shown to facilitate antigen binding, but also (mirroring 44 patterns observed for VRC01 alleles) is associated with variable usage patterns in the IgM and IgG 45 repertoire 41 . Interestingly, in both of these examples, allelic variants vary considerably between human 46 populations 23,41 , indicating that both population-level diversity and the role of germline variants in shaping the 47 baseline B cell repertoire will need to be considered in interpreting vaccine response data 22 . 48 While the dataset we have analyzed here represents the most comprehensive survey to date, it is likely 49 that increasing the sample size will uncover additional genetic contributions to gene usage. For example, by 50 lowering our P value threshold by only a factor of 10, the fraction of IGH genes with usage associated to at 51 least one genetic variant increased from 73% to 91% (74/81). This further bolsters our finding that a large 52 fraction of variation in repertoire gene usage between individuals will likely be explained by variants in IGH. 53 Rarer and complex IGH variants will need to be better accounted for in future work, specifically those excluded 54 from our analysis due to low frequency and genotyping coverage. For example, SV alleles within the highly 55 complex and polymorphic IGHV3-30 region will require sequencing and haplotyping in larger cohorts to better 56 resolve the effects of variation for those genes, which have suspected roles in disease 92, 93 . In addition, it will be 57 important for future work to also consider integrating analyses of the IG light chain loci. Light chain genes 58 contribute to Ab folding and Ab-antigen interactions [94][95][96] , and it is plausible that both trans-effects and 59 interactions between heavy and light chain variants could influence gene usage. The development of models 60 that incorporate both genetic variation and features specific to both chains (e.g., binding and stabilization), 61 would more fully delineate the total genetic contribution to variation in the Ab repertoire. In addition, as cohorts 62 increase in size, additional insight will come from the consideration of other variables such as genetic ancestry, 63 positive/negative selection, age, B cell subset and tissue 97-99 . Finally, the models utilized here could be 64 extended to assess the contribution of IGH polymorphisms to other repertoire signatures, including N/P 65 addition and CDR3 features, which also are influenced by heritable factors 20,21,38,85 . 66 Collectively, our analyses provide the first comprehensive picture of IGH polymorphism and Ab 67 repertoire variation. These findings have the potential to reshape the way we conduct, analyze and interpret 68 AIRR-seq data, and use these data to profile the Ab response in disease. As noted previously, the results 69 provided here further illuminate the need for improving efforts to more fully explore the extent of IGH 70 polymorphism in the human population, as a means to resolve the role of germline variation in Ab function and 71 disease. 72

73
Long-read library preparation and sequencing 74 Genomic DNA was extracted from peripheral blood mononuclear cells (PBMC) or polymorphonuclear 75 neutrophil (PMN) procured from Stanford University, Harvard University or STEMCELL Technologies 76 (Vancouver, Canada). Genomic DNA was processed using our published targeted long-read sequencing 77 protocol 28 . High molecular weight DNA (0.5-2 ug) was sheared using g-tubes (Covaris) and size selected using 78 the 0.75% DF 3-10 Kbp Marker S1-Improved Recovery cassette definition on the Blue Pippin (Sage Science); 79 library size ranges provided in Supplementary Fig. 1. The DNA was End Repaired and A-tailed using the 80 standard KAPA library protocol (Roche). Barcodes were added to samples sequenced in multiplex pools and 81 universal primers were ligated to all samples. PCR amplification was performed for 8-9 cycles using high-82 fidelity polymerase (LA-Taq or PrimeSTAR GXL, Takara) at an annealing temperature of 60℃. Small 83 fragments and excess reagents were removed using 0.7X AMPure PB beads (Pacific Biosciences). Libraries 84 were hybridized to IGH-specific oligonucleotide probes (Roche; see reference 28 ) and recovered using 85 streptavidin beads (Life Technologies) prior to another round of PCR amplification for 16-18 cycles using either 86 LA-Taq or PrimeSTAR GXL (Takara) at an annealing temperature of 60℃. 87 for Sequel/IIe libraries, multiple samples are barcoded and sequenced in a single sequencing run. Critically, 00 the high HiFi read quality overcomes historical concerns of high error rates in long-read sequencing data 01 (Table 1), and error-correction steps performed during the assembly process increases the read base-level accuracy 100,101 . Previously, we have shown that assemblies produced from the older RSII platform have high 03 base-level accuracy 28 . 04 For a single sample, we prepared libraries for adaptive nanopore sequencing using the Ligation 05 Sequencing Kit (Oxford Nanopore Technologies, ONT) and the NEBNext Companion Module for ONT Ligation 06 Sequencing (New England Biolabs). 3 μg gDNA was used as input for these libraries. Entire purified libraries 07 (5-50 fmol, per manufacturer's recommendation) were loaded onto R9.4.1 flow cells on the MinION Mk1C 08 instrument (ONT). The experimental run was set up with no multiplexing, turning on enrich.fast5, and using 09 human nanopore enrichment. Additionally, fast (or high accuracy) base calling was employed for a 72-hour 10 run. In addition to IGH, multiple genomic loci were targeted for sequencing in order to provide the minimum 11 number of bases (17 Mb) required for adaptive sequencing. The IGH sequence targeted was from the custom 12 reference used in this study (below). Custom linear IGH reference 25 A custom linear reference for IGH was used that includes previously resolved insertion sequences 25

absent in 26
GRCh38. This reference was previously used and vetted to generate high confidence variant call sets 28 . The 27 reference was built off of GRCh38 (chr14:105860500-107043718). Partial sequences from GRCh38 were 28 removed and additional insertion sequences were added from previously characterized structural variants 25 . 29 Specifically, sequence between chr14:106254581-106276923 (GRCh38) was swapped for a 10. genotyped using HiFi read coverage and soft-clipped sequences in the assembly and in HiFi reads, and 45 manually resolved using BLAST and custom python scripts. All SV genotypes were visually inspected using 46 Integrated Genome Viewer (IGV) screenshots generated from an IGV batch script.
Characterizing novel alleles and expanding the IGH allele database 48 Novel alleles for IGHV, IGHD and IGHJ genes supported by 10 HiFi reads (exact matches) or found in 2 or 49 more individuals were extracted from the assemblies of each sample. Novel alleles were defined as those not 50 found in the IMGT database (release 202130-2). Allele sequences that aligned to IMGT alleles with 100% 51 identity were also characterized as novel, if the putative novel allele was annotated from a gene in the 52 assembly that was different from the gene assignment in the IMGT database. The non-redundant set of novel 53 alleles was appended to the IMGT database for IgM/IgG repertoire sequencing analyses conducted in this 54 study. A BLAST database was created using makeblastdb version 2.11.0+. Gapped sequences for the novel 55 alleles were generated using the IMGT/V-QUEST server 105 . 56 57 Processing AIRR-sequencing data 58 Paired-end sequences ("R1" and "R2") were processed using the pRESTO toolkit 106 . All R1 and R2 reads were 59 trimmed to Q=20, and reads <125 bp were excluded using the functions "FilterSeq.py trimqual" and "FilterSeq 60 length", respectively. Constant region (IgM and IgG) primers were identified with an error rate of 0.2 and 61 corresponding isotypes were recorded in the fastq headers using "MaskPrimers align". 62 For sequencing datasets without unique molecular identifiers (UMIs), R1 and R2 reads were 63 assembled using "AssemblePairs align", and resulting merged sequences <400 bp were removed using 64 "FilterSeq length". Identical sequences were collapsed, and read duplicate counts ("Dupcounts") were 65 recorded. For sequencing datasets with UMIs, the 12 base UMI, located directly after the constant region 66 primer, was extracted using "MaskPrimers extract". Sequences assigned to identical UMIs were grouped and 67 aligned using "ClusterSets" and "AlignSets muscle", and then consensus sequences were generated for each 68 unique UMI set using "BuildConsensus". Identical sequences with different UMIs were collapsed, and read 69 duplicate counts ("Dupcounts") were recorded. Collapsed consensus sequences represented by <2 reads were 70 discarded.