Introduction

The diversity of sequences in the variable regions of immunoglobulins is the basis for the ability of these molecules to bind a virtually unlimited number of antigenic structures. Sequence diversity in the primary repertoire is created by recombination of germline segments for both the heavy and light chains, which results in the formation of full-length immunoglobulin variable region exons1. The light chain variable region is created by the joining of VL and JL germline segments while the VH region is created by recombination of VH, DH, and JH germline segments. The process of recombination starts with the heavy chain in progenitor B cells, initiated by D/JH recombination followed by VH/DJH recombination2,3. Light chain recombination occurs in pre-B cells after successful completion of the heavy chain recombination. Germline segments in both chains are also trimmed and extended by a variable number of nucleotides by exonucleolytic nibbling of germline segments and random nucleotide incorporation in the N-regions flanking the DH germline segment mediated by terminal deoxynucleotidyl transferase and germline segment palindromic duplications3. Self-reactive B cell clones with full-length, in-frame variable regions have VL sequences replaced by receptor editing or are removed from the repertoire by apoptosis4,5. Cells passing this self-reactivity checkpoint form the naive B cell repertoire5.

The third complementarity-determining region (CDR) of the heavy chain (CDR H3) is the region of highest overall sequence and length diversity in antibody repertoires1. CDR H3 length approximates a Gaussian distribution6 and the average CDR H3 length varies as a function of species, age, isotype, B cell development stage and disease state6,7,8,9,10,11,12,13. CDR H3 amino acid composition is also biased in a CDR H3 length-dependent manner, associated with differential incorporation of DH and JH germline segment sequences of different lengths and sequence composition into CDR H3 of different lengths6. Beyond the germline segment biases, CDR H3 sequence biases that reflect underlying selective biases in B cell maturation have also been described. In particular, a bias towards shorter average CDR H3 lengths is observed in mature relative to immature B cells and in isotype-switched memory B cells relative to naive to B cells9,10,14. This is accompanied by a reduction of positively charged residue content and hydrophobicity within CDR H3 associated with negative selection of self-reactive clones in the repertoire9,11,15,16.

The analyses of CDR H3 diversity and biases in health and disease have been mostly performed independently of the V regions contributed by VH and VL germline segments6,7,8,9,10,11,17,18,19,20. Except for sequences that are directly incorporated into CDR H3, the impact of V germline segments on CDR H3 properties has neither been addressed nor expected. Analysis of the impact of the VL on CDR H3 has been limited to properties of the third CDR of the light chain, which is closely associated with CDR H3, without any evidence of biases14. Finally, analysis of the impact of JH germline segments on CDR H3 biases has been confined to the expected effects of differential JH germline segment length and sequence composition6. A recent analysis of a large dataset of isotype-switched human antibody sequences with paired chain information revealed an unexpected preferential pairing of IGHV3-7 (VH3-7) and Vκ2-30 germline segments21. This was determined upon further analysis to be linked with short CDR H3 length biases associated with both germline segments. This unexpected finding prompted us to undertake a high-dimensional analysis of CDR H3 sequences from several human antibody datasets to investigate the extent to which CDR H3 length might be biased by germline segment use in human immunoglobulin repertoires. Our results show several biases in CDR H3 and junctional length distributions associated with VH, VL, and JH germline segment utilization that shape naive and antigen-experienced antibody repertoires in unexpected and unpredictable patterns.

Results

Sequence datasets

We analyzed sequences from four previously described B cell repertoire deep sequencing datasets including three donors each and a fifth dataset with eight donors (used here as an independent test dataset in most analyses to avoid over-representation of donors from a single source in combined data), referred to here as the TX, WA, CA, MA, and SRI datasets20,21,22,23,24,25. These represent the largest publicly available datasets with VH/VL pairing (TX and CA) or VH-only information (MA, WA, SRI). The datasets were previously sequenced and bioinformatically parsed using a diversity of methods, minimizing the impact of methodological biases. A summary of the datasets including the number of donors, CD27 marker and isotype information has been summarized in Table 1 (see Supplementary Table 1 for details). Data was normalized by analyzing each donor individually and averaging within or among datasets to avoid over-representation of sequences from larger datasets. No antigen-specific selection of B cells was performed for any of the datasets, although the CA and MA datasets include both pre- and post-vaccination samples21,22. For simplicity we refer to the TX CD27pos IgG/IgA, CA IgG, MA IgG/IgA, WA CD27pos, and SRI IgG subsets as TX AE, CA AE, MA AE, WA AE, and SRI AE, respectively (with “antigen-experienced”, or “AE”, encompassing all five), the TX CD27pos IgM sequences as “AE IgM” and the TX and WA CD27neg subsets as TX naive and WA naive respectively (with “naive” encompassing both, Supplementary Table 1). SRI IgM sequences with no amino acid somatic mutations between Cys-23 and Cys-104 (International ImMunoGeneTics, IMGT®, numbering system26 used throughout) including most of the region covered by reads are referred as SRI naive.

Table 1 Summary of datasets used for analysis.

We aimed at identifying properties shared among donors not influenced by clonal expansion. To minimize the impact of clonal expansion, all datasets except nonproductive sequences were processed to retain a single sequence from each lineage, or clonotype, by clonotype clustering, according to germline segments as well as CDR H3 length and sequence similarity (see ‘Clonotype clustering’ under Methods and Supplementary Data 1, 2, and 3 for unique sequence counts after clustering). Overall distribution of CDR H3 lengths was not noticeably affected by removal of redundant sequences in most datasets except for the WA AE and MA AE compartments, which had subtle shifts (Supplementary Fig. 1a-f). The overall AE CDR H3 length distributions are similar among datasets except SRI, allowing pooling data of AE B cell subsets from different datasets (Supplementary Fig. 1g). However, the relative CDR H3 length distributions of the WA and TX naive B cell subsets differed by an average of 0.9 residues (Supplementary Fig. 1h) and were analyzed separately. Germline segment-specific analyses were performed with germline segments with at least 80 unique clonotypes in a donor, which, in aggregate, include 94–99% of the unique clonotype sequences in the CA, TX, MA and SRI datasets (Supplementary Data 1, 2, and 3). Germline segment-specific analyses in the WA dataset were performed with 16 VH germline segments that had fewer than 10% ambiguous calls in the naive subset and germline segments prevalent in other datasets, totaling about one third of the entire dataset (Supplementary Data 4).

Average CDR H3 length varies with VH and VL germline segment use

We analyzed average CDR H3 length by VH or VL germline segment use. Average CDR H3 length in the AE subset varied by up to 3 amino acid residues as a function of VH germline segment use and correlated well for different datasets when compared to the TX dataset (Spearman’s r = 0.72–0.84) (Fig. 1a). Average CDR H3 length also varied as a function of VL germline segment use by up to four amino acid residues in the AE compartment and correlated well between the CA and TX datasets (Spearman’s r = 0.93) (Fig. 1e). The naive compartment showed a more limited spread in average CDR H3 lengths relative to the AE compartment (Fig. 1b, c, d and f, blue squares). More pronounced reductions in average CDR H3 length in the AE relative to naive compartments were associated with a subset of VH and VL germline segments (Fig. 1b, c, d and f). The TX AE IgM subset showed similar trends as the TX AE subset except that average CDR H3 length was decreased in association with most VH germline segments relative to the naive compartment (Fig. 1b, pink dots).

Fig. 1: Average CDR H3 length associated with VH and VL germline segments.
figure 1

Average CDR H3 lengths associated with different germline segments in the TX AE dataset (abscissa) correlated with the CDR H3 associated with VH germline segments of different datasets (a) and CA VL germline segments (e) in the AE compartments. Each dot represents the average CDR H3 length for a different VH or VL germline segment within each dataset, averaged across donors within each dataset. Diagonal lines show 1:1 length correspondence between datasets, with Spearman’s correlation coefficients shown. bd Average CDR H3 length associated with VH germline segments in the TX AE, TX AE IgM, WA AE, SRI AE and corresponding naive B cell datasets. Solid horizontal bars indicate average CDR H3 length in the naive B cell compartment. The order of germline segments is the same in the panels. f Average CDR H3 length associated with VL germline segments in the TX AE and naive B cell datasets. Solid horizontal bars indicate average CDR H3 length in the naive B cell compartments. Error bars indicate S.E.M. for n = 3 donors or range for n = 2 donors. Germline segments without error bars have only 1 donor with at least 80 sequences (TX dataset, see Supplementary Data 1, 2 and 3) or sequences mostly represented by 1 donor (WA, see Supplementary Data 1 and 2). Lengths are shown in amino acid residues according to the IMGT® CDR definition26.

CDR H3 length distribution varies with VH germline segment use

We next determined whether CDR H3 length distribution varies with germline segment use. Overall and germline segment-specific CDR H3 length distributions were performed individual for each donor and the frequency of each CDR H3 length averaged within or across the TX, CA, MA, and WA datasets (11 donors), with the eight donors from the SRI dataset analyzed separately. Overall CDR H3 length distribution of the respective B cell compartment, which is influenced by germline segment frequency in the datasets, was used as a relative standard to which germline segment-specific CDR H3 length distributions were compared. This was done to facilitate comparison of biases between germline segments and also because useful objective reference distributions are not available to determine bias types in naive compartment sequences. Therefore, most biases described here, including all in the naive compartment, are relative to the entire set of clonotypes in each B cell compartment. Statistical analysis of biases was performed in the AE compartment by a two-tailed paired (by donor) t-test of length frequencies with a sliding window of two consecutive CDR H3 lengths to minimize the impact of local fluctuations. Observed length distribution biases included overall shifts in average CDR H3 length for sequences with different VH germline segments and also obvious and subtle deviations from the overall CDR H3 distribution in discrete ranges of the length spectrum (Fig. 2, top row, Supplementary Fig. 2a).

Fig. 2: CDR H3 length distribution groups associated with VH germline segments.
figure 2

Characteristic examples of each VH bias group are shown, averaged for TX, CA, MA and WA donors in the AE compartment (top row, n = 11 donors), TX naive (second row, n = 3 donors), WA naive (third row, n = 3 donors) and WA nonproductive compartments (bottom row, three pooled donors). Orange bars are germline segment-specific CDR H3 length distributions of unique clonotypes. Blue bars are overall CDR H3 length distributions of unique clonotypes. Blue and red horizontal lines above the distributions indicate range of CDR H3 lengths statistically significant different between germline segment-specific and overall length distributions in a two-tailed paired t-test (P < 10−4) with a sliding window of two contiguous CDR H3 lengths, with red and blue indicating relative enrichment and depletion in the germline segment-specific distributions. Distributions were determined for each donor individually followed by averaging across donors, except for the nonproductive sequences in which donors were combined prior to calculation of distributions. Error bars indicate S.E.M. The full set of distributions is shown in Supplementary Fig. 2.

To further discern the CDR H3 length biases quantitatively, we performed a principal component (PC) analysis of the length distributions (lengths 5–26) associated with different VH germline segments. Results from the PC analysis were visualized by projecting each germline segment onto the most important trends to obtain the so-called PC scores, aided by a visual analysis of the corresponding distributions (Fig. 3a). PC1 and PC2 corresponded to apparent skewness and kurtosis of the distributions, respectively. Using the PC analysis results in conjunction with visual inspection of VH germline segment-associated CDR H3 distributions in the AE compartment, germline segments were categorized by bias type as “Short”, “Neutral” and “Long” (Figs. 2 and 3a, Supplementary Fig. 2). Those germlines that had similar length distributions as the overall distribution were called “Neutral” (located around the center of the PC plot); while those with shifts towards longer or shorter lengths as “Long” (right-skewed, with low values of PC1) and “Short” (left-skewed, with high values of PC1) respectively. Within each of these classes, some germlines also showed varying degrees of kurtosis relative to the overall distribution (extreme values in PC2). Differences between the distributions of members of different groups can be subtle, both visually and in the PC analysis. The magnitude of the biases and the details of distribution shapes varied for different VH germline segments within each group but were consistent across datasets for each germline segment (Supplementary Fig. 3). Germline segments in the same VH subfamily did not always have the same biases. The range of germline segment prevalence in the various datasets was similar for the different bias groups (Supplementary Fig. 4a).

Fig. 3: Differentiation of CDR H3 length distributions by PC analysis.
figure 3

PC analyses of VH (a) and VL (b) germline segment-associated CDR H3 length distributions in the AE compartments are shown. Analysis in a excludes the WA dataset due to limited germline segment coverage. Analysis in b includes both CA and TX datasets. Each data point indicates a germline segment for which minimum count requirements were met for the analysis. Points are color-coded by bias group determined by visual inspection of CDR H3 length distributions (Supplementary Fig. 2 and 6). Axes are oriented to position distributions skewed towards long lengths and with high apparent kurtosis to the right and top respectively.

We determined whether the observed distribution biases were also present in the naive B cell subset. The biases of the Long group were also observed in the naive B cell compartment, without apparent differences relative to the distributions in the AE compartments (Fig. 2, Supplementary Fig. 2). All the germline segments in the Neutral group showed average CDR H3 length distribution in the naive subset as well. However, distribution biases of the Short group in the naive compartment were less prominent (Fig. 2 and Supplementary Fig. 2), consistent with the average CDR H3 length analysis (Fig. 1b and c). Short biases in the naive compartment were mostly limited to the VH3-73 and VH3-15 germline segments in the TX and WA datasets. Despite the differences in overall CDR H3 length between the TX and WA naive datasets, the biases in the naive compartment had the same trends in both datasets (Fig. 2 and Supplementary Fig. 2).

The data analysis was performed with datasets filtered for sequences likely to belong to the same lineage. To confirm that biases are not due to pockets of clonal expansion, we performed a repertoire similarity index (RSI) analysis with the CA, TX and MA datasets similar to a recently described method27 (see Methods for more details). Overall, no apparent increase in RSI scores indicating clonal expansion was associated with regions of positive prevalence biases in different parts of the CDR H3 length spectrum for the different bias groups (Supplementary Fig. 5a), confirming that clonal expansion does not account for the observed CDR H3 length biases.

CDR H3 length distribution biases are not generated by recombination

We next determined whether the biases observed in the naive compartment are a direct consequence of biases in the VDJ recombination process for each germline segment. For this, we analyzed frameshifted, nonproductive VH sequences that were part of the naive WA dataset. Nonproductive recombination products are not directly subject to selection and therefore provide information about recombination products prior to any repertoire selection. As previously reported16, the CDR H3 lengths of nonproductive VH genes are longer than the productively recombined genes in mature B cell subsets (Fig. 2 and Supplementary Fig. 2). However, CDR H3 length for the nonproductive sequences associated with different VH germline segments approximated a Gaussian distribution, with no observable biases associated with different VH germline segments relative to the overall dataset, except for minor anomalies associated with some germline segments (Fig. 2 and Supplementary Fig. 2). Therefore, heavy chain recombination mechanisms do not account for the naive repertoire CDR H3 length distribution biases.

CDR H3 length distribution varies with VL germline segment use

We performed a similar analysis of CDR H3 length distribution as a function of VL germline segment and B cell compartment using PC and visual analysis. Similar to the VH germline segment-associated biases, VL-associated biases in the AE compartment could be classified into three groups based on the skewness of their distributions, named here “Short” (right-skewed, with high value of PC1), “Long” (left-skewed, with low value of PC1) and “Neutral” (intermediate values of PC1), present in both the CA and TX datasets, each group including a diverse set of Vκ and Vλ germline segments (Figs. 3b and 4, top row, and Supplementary Fig. 6). PC1 and PC2 for the light chain were also associated with apparent skewness and kurtosis. The VL Long bias group has Gaussian CDR H3 length distributions, whereas the VL Short bias group includes distribution shapes with marked deviations from Gaussian, including localized frequency spikes in discrete sections in the short range. Only Vκ germline segments in the Long group were associated with similar CDR H3 length biases in the TX naive compartment (Fig. 4, Supplementary Fig. 6). The magnitude of the VL-associated biases varied for different germline segments within each bias group but were consistent between datasets (Supplementary Fig. 7). As above, the RSI analysis results indicated that clonal expansion does not account for the VL germline segment-associated CDR H3 length biases (Supplementary Fig. 5b). The prevalence of Short group germline segments in the dataset was lower than for germline segments of the other two groups (Supplementary Fig. 4b).

Fig. 4: CDR H3 length distribution groups associated with VL germline segments.
figure 4

Characteristic examples of each group are shown, averaged for all CA and TX donors in the AE compartments and the naive compartment of TX donors. Blue and red horizontal lines above the distributions indicate range of CDR H3 lengths statistically significant different between germline segment-specific and overall length distributions in a two-tailed paired (within donor, n = 5 donors) t-test (P < 10−2) with a sliding window of two contiguous CDR H3 lengths, with red and blue indicating relative enrichment and depletion in the germline segment-specific distributions. Number of donors is 5 for TX AE and CA AE combined and 3 for TX naive except for Vλ7-46 in TX naive, where n = 1. Distributions were determined for each donor individually followed by averaging across donors. Error bars indicate S.E.M. The full set of distributions is shown in Supplementary Fig. 6.

CDR H3 length is biased as a function of VH/JH combination

JH germline segments vary in the number of amino acid residues that can be potentially contributed to CDR H3 from 4 in JH4 to 9 in JH6. We assessed whether differential JH germline segment usage as a function of VH and VL germline segment use is the basis for V segment-associated CDR H3 length biases. No clear deviations from average JH usage were observed in association with most VH germline segments in the WA unproductive sequences (Supplementary Fig. 8a and 9, top panel). The observed deviations in JH prevalence do not readily explain CDR H3 distribution biases associated with VH and VL germline segments (Supplementary Fig. 8b, c and d, and Supplementary Fig. 9, bottom panel) with the exception of VL germline segment Vκ2-28 in the Long CDR H3 bias group, which was associated with a higher prevalence of the longer JH6 and lower prevalence of the shorter JH4 germline segments in the heavy chain (Supplementary Fig. 8d).

We next analyzed CDR H3 length distributions associated with different VH/JH germline segment combinations, comparing these to the CDR H3 length distribution of all sequences with the corresponding JH germline segment. As expected, CDR H3 length distributions were generally shifted according to length of the JH segment in the germline regardless of VH germline segment (Fig. 5, Supplementary Fig. 10 and 11). However, a subset of VH-associated CDR H3 length biases were impacted by JH germline segment in a manner independent of length of the JH segment in the germline, with very similar patterns in the naive WA and naive SRI subsets (Fig. 5, Supplementary Fig. 10 and 11). These included a short CDR H3 length bias associated with sequences with the VH3-72, VH3-73, and VH3-15 germline segments combined with the JH5 and/or JH4 germline segments (Fig. 5, Supplementary Fig. 10 and 11). Additional CDR H3 length biases were observed for other VH/JH germline segment combinations (Supplementary Fig. 10). Our results indicate that CDR H3 length distribution biases are not necessarily uniform for each VH germline segment but may vary in association with JH germline segment. In addition, the effect of JH on CDR H3 length distribution is not necessarily similar within VH bias groups, indicating some degree of heterogeneity within bias groups.

Fig. 5: Modulation of VH-associated CDR H3 length biases by JH germline segments in the WA naive compartment.
figure 5

CDR H3 length distributions associated with individual VH/JH combinations (orange bars) are compared to overall repertoire distributions associated with JH germline segments (blue bars). Arrows highlight biases dependent on JH segment use. Dotted lines indicate the of peaks in the distributions associated with VH3-9 and all JH germline segments in the same part of the length spectrum. The number of CDR H3 amino acid (aa) residues potentially encoded by the JH3 to JH6 germline segments is indicated in row legends. Distributions were determined for each donor individually followed by averaging across donors. Error bars indicate S.E.M. (n = 3). The full set of distributions is shown in Supplementary Fig. 10. The figure panels are ordered vertically by the number of residues the JH segments can contribute to CDR H3.

Differentially trimmed JH segments in the naive compartment

The CDR H3 length distribution biases associated with a subset of VH/JH germline segment combinations may be a consequence of biases in JH trimming as a function of VH germline segment. JH residue occupancy in the last CDR H3 positions of JH4 and JH5 sequences was used to indirectly determine JH trimming. The JH1, 2, 3, and 6 germline segments were not analyzed due to lack of sufficient data or, in the case of JH6, limited CDR H3 length biases associated with it. No apparent biases in JH residue occupancy relative to the overall dataset was observed for any of the analyzed VH/JH combinations in the nonproductive WA sequences (Supplementary Fig. 12). However, JH residue trimming biases were observed for different VH/JH combinations in the WA naive compartment (Fig. 6 and Supplementary Fig. 12). General trends in residue occupancy in JH4 were similar in SRI naive sequences for the VH/JH4 germline segment combinations with sufficient numbers for analysis (Supplementary Fig. 13). Residue-specific trimming biases were mostly coordinated for consecutive JH residues in each analyzed VH/JH combination, as expected due to the directional nature of trimming. However, closely related VH germline segments can be associated with distinct trimming biases of different JH4 residues. For instance, VH2-5/JH4 sequences are associated mostly with reduced trimming of IMGT® residue 115 (Phe) whereas for VH2-70/JH4 strongly reduced trimming of residue 116 (Asp) was also observed (Fig. 5 and Supplementary Fig. 12). The results indicate a complex set of constraints leading to selection of differentially trimmed JH segments in the context of certain VH and JH germline segments during naive repertoire maturation.

Fig. 6: Biases in JH4 germline segment residue occupancy in CDR H3 associated with different VH germline segments and CDR H3 length in WA naive sequences.
figure 6

Occupancy of JH germline segment-encoded residues in the last CDR H3 positions (IMGT® 114–117) is shown. JH4 residues are color-coded by position as indicated in the VH3-23 panel by “YFDY”, the four JH4 residues that can be included in CDR H3 in positions 114–117. Solid lines indicate average residue occupancy for all sequences in the naive repertoire with the JH4 germline segment. Dots indicate average residue occupancy with each VH germline segment and CDR H3 length. Bars indicate S.E.M. (n = 3 donors). Data points with fewer than 80 sequences were excluded.

Discussion

Understanding antibody CDR H3 diversity generation, a process critical for the availability of immune receptors binding a wide range of antigens, has long been a goal in the immunology and antibody engineering fields. Numerous reports have described overall CDR H3 length and amino acid composition biases in health and disease and in different B cell developmental stages6,7,8,9,10,11,12,13,14,15,16,17,18,19,20. CDR H3 and junctional segments lengths in the B cell repertoire have been assumed to be independent of VH and JH germline segments except for their lengths prior to recombination17,19,20. This assumption has been implicitly used, for example, to generate simulated CDR H3 datasets to estimate the significance of observed clonal convergence in repertoires20. Analysis of the impact of VL on CDR H3 length has been limited to CDR L3 length, with negative results23. Here we describe detailed, high-dimensional analyses of CDR H3 and junctional segment length distributions and show a complex set of biases determined by VH, VL, and JH germline segment use and B cell maturation state that are not predictable from germline segment lengths and amino acid sequences. Most of the length and junctional biases we describe are evident in the naive B cell compartment but not in the nonproductive subset, indicating a major role of naive B cell repertoire maturation and, by extension, selection against self-reactivity or for structural integrity as likely factors in shaping those biases. In addition, only a subset of VH or VL germline segments is associated with biases towards shorter CDR H3 lengths in the antigen-experienced compartment, indicating general germline segment-specific adaptive immunity selection processes shared among individuals. Similar T cell receptor β chain CDR3 length distribution biases with different TRBV germline segments in repertoires arise in the process of T cell maturation28, although selective processes may differ between B and T cell repertoires due to differences in mechanisms of antigen recognition. Our results provide a detailed view into the dynamics of heavy chain junctional biases in antibody repertoires that complements previously described dynamics of clonal selection and expansion, convergence, sequence diversity and changes in overall CDR H3 properties in B cell maturation9,10,11,20,22,25,27,29.

Special consideration was given to the repeatability and robustness of the findings. The results are based on a total of 12 donors in four datasets and confirmed by analysis of 8 additional donors from the SRI dataset. These datasets were obtained and parsed with different sequencing methods and bioinformatic pipelines, minimizing the impact of technical artifacts. Some of the biases, such as those associated with VL, and JH cannot be easily generated by sequencing or parsing artifacts, especially in a systematic fashion across datasets. The stringency of clonotype clustering criteria had limited impact on results. This is exemplified by the WA and SRI datasets, which yielded CDR H3 length distributions similar to other datasets (Supplementary Fig. 3) despite having been clustered by clonotype using a higher CDR H3 sequence identity threshold than other datasets (Supplementary Table 1).

A recurring theme in the results presented here was that biases observed at one level (e.g., VH germline segment) were only partly explained by biases at higher-dimensional levels (e.g., VH/JH combinations), with additional unexpected biases observed in the higher-dimensional levels. It is expected that higher-dimensional analyses including other repertoire descriptors will uncover additional biases, two examples being DH junctional length biases (Supplementary Fig. 14, 15, and 16 and Supplementary text) and CDR H3 length biases associated with different VH germline segment allelic variants (Supplementary Fig. 17). Haplotype variations could also potentially affect CDR H3 length distributions in a VH allele-dependent manner or through differences in DH germline segment composition and differential recombination frequencies of DH or JH germline segments of different lengths in different chromosomes, combined with differential recombination frequencies of VH alleles30,31,32. However, the observation of essentially the same CDR H3 length distribution biases in several donors from 5 different sources and junctional segment length biases in 11 donors from 2 of these sources, along with a lack of systematic associations between VH, DH, and JH alleles across donors31,32, indicates that haplotype variations are unlikely to be a major factor in the CDR H3 and junctional length distribution biases described here. In addition, heavy chain variable region haplotype differences would not be expected to impact CDR H3 distributions associated with VL germline segments and the AE compartment-specific short CDR H3 length biases.

The analyses shown here use germline segment information as a proxy for undefined sequence features that ultimately determine the observed biases. The selected CDR H3 sequence and structural properties that result in the observed biases and the germline segment sequence properties that determine those biases remain to be identified. Analysis of VH germline segment residues that can directly encode or bias the first CDR H3 residues in IMGT® positions 105–107 did not reveal clear correlations between these and most CDR H3 bias groups or junctional segment length biases (Supplementary Fig. 18). In addition, no obvious correlations between JH trimming biases and variations in VH germline segment residues in positions 40–42 generally contacting the differentially trimmed JH residues 115 and 116 were observed. The differentially trimmed residue 116 is located in a region at the base of CDR H3 that can adopt either a “bulged” or “extended” conformation33,34. The bulged conformation appears to depend on the Ig domain, encoded mostly by VH germline segments34,35. Whether VH germline segment-dependent JH trimming biases reflect biases in the structure of the CDR H3 base remains to be determined. One challenge in determining how different germline segment regions determine the observed biases is the relatively limited number of non-redundant human antibody structures with different VH/JH combinations or VL germline segments with different CDR H3 lengths.

The CDR H3 biases described here pose questions about the functional properties that might shape those biases and the functional consequences of these biases for adaptive immunity. The emergence of some biases in the naive repertoire suggests selection against self-reactivity, selection for structural integrity, expression or a combination of these factors as possible mechanisms. If related to selection against self-reactivity, the different biases indicate either that features other than CDR H3 charge and hydrophobicity contribute to self-reactivity or that V germline segments outside CDR H3 modulate the self-reactivity mediated by these factors. The possible role of selection against self-reactivity may allow leveraging the biases observed in CDR H3 in large natural human antibody repertoire deep sequencing datasets to understand CDR H3 properties beyond charge that correlate with antibody polyspecificity and self-reactivity, a subject of considerable interest in therapeutic antibody development and in the understanding of functional consequences of immune disorders with altered CDR H3 sequence profiles11,36,37. The bias towards shorter CDR H3 lengths associated with a subset of VH and VL germline segments in the AE compartment may be attributable to these same mechanisms or to antigen-driven immune selection. The latter would suggest widespread convergences in human repertoires associated with certain VH and VL germline segments or, possibly, some degree of functional specialization in the germline segment repertoire linked to short CDR H3 sequences, analogous to the association between CDR H3 length and recognition of different antigen classes38.

Our results point to unexpected cross-constraints between VH, VL, JH, and other junctional elements selected at different stages of B cell development that shape CDR H3 and junctional length distributions in antibody repertoires. That is, CDR H3 length distribution in the repertoire is not independent of heavy and light chain VH, VL, and VH/JH germline segment usage as implicitly assumed6,7,8,9,10,11,17,18,19,20. Instead, overall CDR H3 and junctional length distributions in antibody repertoires are aggregates of several sub-repertoires with discrete sets of biases relative to each other that arise in different stages of B cell maturation as a function of VH, VL, and JH germline segment use, shared by normal donors. Thus, proper description of CDR H3 length biases in disease and immune states11,12,39 requires the context of VH, VL, or VH/JH germline segment usage in which these biases are observed for meaningful interpretation. The analyses described here provide a high-dimensional CDR H3 analytical framework, in which CDR H3 and junctional length distributions are analyzed in the context of VH, VL, and JH germline segments and combinations of these, and a baseline of these biases with multiple healthy donors for further studies of B cell repertoire maturation and clonal selection in health and disease.

Methods

Datasets and analysis

Sequences were obtained from the original publications14,21,25 except for the MA dataset. The sequences in the MA dataset were obtained from a re-sequencing by Illumina MiSeq of a set of previously described samples22, deposited in the Sequence Read Archive (SRA) database24. A summary of the samples used here is given in Supplementary Data 5. Sequencing methods for the MA dataset are described in the experiment design section associated with sample data (see https://www.ncbi.nlm.nih.gov/sra/SRX2251687). The SRI dataset donors included in the analyses are 316188, 326650, 326737, 326780, 326797, 326907, 327059, and D10320. SRI dataset donors 326713 and 326651 were only used for VH germline segment allele-specific analyses. Sequences were used as parsed in the original publications except for sequences of the MA dataset obtained from the SRA database24, where the raw sequencing files were processed and germline segments annotated with a custom pipeline (available from docker hub repository at https://hub.docker.com/r/kamhonhoi/iganalysis). Briefly, paired-end reads were merged using FLASH40 to reconstruct the full-length variable domain sequences using the following parameters: read length at 300 bps, expected fragment length at 530 bps, standard deviation at 50 bps. The full-length sequences were subsequently processed to identify the frameworks and CDR regions using position-weighted motifs as previously described41. IgBlast42 was used to supplement the region parsed data with germline segment annotation for each sequence, including nucleotide somatic mutations. Isotypes of the sequences were determined by finding the closest matching human CH1 isotypes on the available CH1 sequences. Each sequence was processed and annotated with the frameworks, CDRs, germline segment use and clonotype grouping (see below). Nonproductive sequences in the WA dataset used for analyses were limited to frameshifted sequences in the naive compartment to minimize the indirect effects of sequencing errors and clonal expansion. CDR H3 length of nonproductive, frameshifted sequence length in amino acid residues was set as the nearest integer of CDR H3 length in nucleotides divided by 3. For naive compartment sequences of Donor 1 of the WA dataset only the D1a repeat was used for most analyses25. All CDR H3 length distributions and germline segment prevalence analyses were determined using custom scripts and Microsoft Excel 2016. The IMGT® CDR definition and numbering system are used throughout26.

Clonotype clustering

Clonotypes in the CA and TX datasets were defined as sequences from the same donor, VH and VL germline segments and CDR H3 length with a nominal 57% or greater CDR H3 amino acid sequence identity, which better approximates an average 60% CDR H3 sequence identity across the range of CDR H3 lengths. IgG/IgA and IgM sequences were segregated prior to clonotype clustering. For the TX dataset IgG/IgA, CD27pos/IgM and CD27neg sequences were segregated prior to clonotype clustering. The 11% of sequences without isotype information in the TX AE datasets, which included IgG, IgA, and IgM sequences, were excluded from the analyses due to the differences between IgG/IgA and IgM overall distributions in the AE compartment (Fig. 1b). Clonotypes in the MA dataset were defined as sequences from the same donor, VH and JH germline segments and CDR H3 length with a nominal 57% (average 60% identity across CDR H3 lengths) or greater CDR H3 amino acid sequence identity as above. Clonotypes in the SRI dataset were defined as sequences from the same donor with the same VH and JH germline segment, isotype and same CDR H3 length and sequence. Only sequences labeled as “productive” in the SRI dataset were analyzed. Only a randomly chosen sequence from each clonotype was retained in the dataset for the TX, CA, WA, MA, and SRI datasets. Clonotypes in the WA dataset were defined as sequences from the same donor with the same VH and JH germline segment and same CDR H3 length and sequence. If VH germline segment information was not available then VH subfamily information was used in lieu, retaining as a representative for the clonotype a sequence with VH germline segment information if available. If JH germline segment information was not available then this parameter was ignored, also retaining otherwise identical sequences with available JH information as representatives for clonotypes, if available. Nonproductive sequences in the WA dataset were not processed for clonotype clustering.

Repertoire similarity index analysis

RSI was computed in a manner similar to a previously described method27. For a given set S of CDR H3 sequences, all of the same length n, RSI is measured as follows:

$$RSI = \left[ {1 - \frac{{{\mathrm{median}}\left\{ {LD\left( {S_i,S_j} \right)_{i \ne j}} \right\}}}{n}} \right] \times 100\% \,\forall \,1 \le i,j \le |{\boldsymbol{S}}|$$

where Sj and Sj refer to any two sequences in the set of CDR H3 sequences and LD(Si,Sj) refers to the Levenshtein distance function, which measures the number of amino acid changes necessary to convert Si to Sj. For a given VH germline segment and CDR H3 length, RSI values were computed for those sequences that shared the same VL germline segment (for the paired CA and TX datasets) or the same JH germline segment (for the unpaired MA datasets) and the same CDR H3 length. Values were computed separately for each donor in the datasets and averaged for each length. Values shown in graphs in Supplementary Fig. 5 are the averages in each length for different datasets. All calculations were performed using custom scripts in R.

Principal component analysis of CDR H3 length distributions

The length distribution of each germline segment was captured as a vector of length 22 containing the percentage of sequences of length 5–26. For VH, the values for each germline segment were averaged overall the AE datasets except the WA dataset due to limited germline segment coverage. For VL, the values were averaged over the CA and TX datasets. The distributions of each germline segment were consolidated into a matrix Xn×22 where n is the number of germline segments considered for analysis (n = 39 for VH and n = 35 for VL). The variance covariance matrix S22×22 of X was computed with elements Sij as

$$S_{ij} = < \left( {X - \left\langle {X_i} \right\rangle } \right)\left( {X - \left\langle {X_j} \right\rangle > \forall \,1 \le i,j \le 22} \right.$$

where <> refers to average across all germline segments. Eigen decomposition of the matrix S results in 22 eigenvectors, each of which capture a trend in the distribution as a function of the CDR H3 lengths and are sorted in decreasing order of the variance they capture. Each germline segment was then projected onto these eigenvectors to obtain the PC scores which enabled visualization of the different trends and comparisons among the different germline segments. Principal Component Analysis was performed using the ‘princomp’ function and Repertoire Similarity Index, implemented using the ‘sdists’ function in the package ‘cba’ in R version 3.5.0.

Statistics and reproducibility

Samples consist of independently sequenced antibody repertoires from 20 donors from five laboratories under different sequencing conditions, comprising five datasets of with 2–8 donors each with varying number of sequences per donor (see Supplementary Data 1, 2, and 3 for details and sample sizes for each dataset, donor, B cell subset and germline segment). To avoid over-representation of sequences from donors and datasets with higher sequence counts, calculations were performed per donor and then averaged per dataset or across datasets except for Supplementary Fig. 1, where sequences from donors of each dataset were pooled prior to calculations. Germline segments within a donor with fewer than 80 counts were excluded from analyses. Reproducibility was assessed by comparing distributions between datasets, with the exception for nonproductive sequences, which were present only in the WA dataset. Two-tailed paired t-tests of CDR H3 length distributions were performed using Microsoft Excel for the Mac 2016. Data from individual donors comparing germline segment-specific to overall repertoire CDR H3 length prevalence from the CA, TX, MA, and WA datasets (n = 11 donors for VH, n = 5 donors for VL) were used for t-tests of AE compartment samples over a sliding window of two consecutive CDR H3 lengths to minimize local fluctuations. P-value thresholds of 10−4 (VH) and 10−2 (VL) were used to assess significance across the range of CDR H3 lengths. Mann-Whitney tests for distributions in Supplementary Fig. 17 were done using GraphPad Prism version 6.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.