Heavy Chain CDR 3 and Junctional Length Biases in Human Antibody Repertoires Associated with Heavy and Light Chain Germline Utilization

Antibody variable domain sequence diversity is generated by recombination of germline segments. The third complementarity-determining region of the heavy chain (CDR H3) is the region of highest sequence diversity and is formed by the joining of heavy chain VH, DH and JH germline segments combined with random nucleotide trimming and additions between these segments. We show that CDR H3 length distribution is biased in human antibody repertoires as a function of VH, VL and JH germline segment utilization. Most length biases are apparent in the naïve B cell compartment, with a significant bias towards shorter CDR H3 sequences observed in association with a subset of VH and VL germlines in the antigen experienced compartment. Similar biases were not observed in nonproductive heavy chain recombination products, indicating selection of the repertoire during B cell maturation as a major driver of the length biases. Some VH-associated CDR H3 length biases are dependent on utilization of specific JH germline segments in a manner not directly linked to JH segment length in the germline, but are rather associated with selection of differentially trimmed JH segments in the naïve compartment. In addition, DH segment and N-region random nucleotide insertion lengths within CDR H3 in the naïve compartment were also biased by specific VH/JH germline combinations, indicating a complex set of constraints between germline segments selected during repertoire maturation. Our findings reveal biases in the antibody diversity landscape shaped by VH, VL, and JH germline features with implications for mechanisms of naïve and immune repertoire selection.


Introduction
The diversity of sequences in the variable regions of immunoglobulins is the basis for the ability of these molecules to bind a virtually unlimited number of antigenic structures. Sequence diversity in the primary repertoire is created by recombination of germline segments for both the heavy and light chains which results in the formation of full-length immunoglobulin variable region exons (1). The light chain variable region is created by the joining of VL and JL germlines while the VH region is created by recombination of VH, DH (or D) and JH germlines. The process of recombination starts with the heavy chain in progenitor B cells, initiated by D/JH recombination followed by VH/DJH recombination (2,3). Light chain recombination occurs in pre-B cells after successful completion of the heavy chain recombination. Germline segments in both chains are also trimmed and extended by a variable number of nucleotides by exonucleolytic nibbling of germline segments and random nucleotide incorporation in the N-regions flanking the D germline mediated by terminal deoxynucleotidyl transferase and germline palindromic duplications (3). B cell clones with full-length, in-frame variable regions are further selected to remove or induce receptor editing of self-reactive clones to form the naïve immune repertoire (4,5).
The third complementarity determining region (CDR) of the heavy chain (CDR H3) is the region of highest overall sequence and length diversity in antibody repertoires (1). CDR H3 length approximates a Gaussian distribution (6). Average CDR H3 length varies as a function of species, age, isotype, B cell development stage and disease state (6)(7)(8)(9)(10)(11)(12)(13). CDR H3 amino acid composition is also biased in a CDR H3 length-dependent manner, associated with differential incorporation of D and JH germline sequences of different lengths and sequence composition into CDR H3 of different lengths (6). Beyond the germline-associated biases, CDR H3 has been shown to undergo different biases during B cell maturation. In particular, a bias towards shorter average CDR H3 lengths is observed in mature relative to immature B cells (9). This is accompanied by a reduction of positively charged residue content and hydrophobicity within CDR H3 associated with negative selection of self-reactive clones in the repertoire (9,11,14,15). A similar reduction in CDR H3 length occurs in isotype-switched memory B cells relative to naïve to B cells (10,16).
The analyses of CDR H3 diversity and biases in health and disease have been mostly performed independently of the V regions contributed by VH and VL germlines (6-11, 17, 18).
Except for sequences that are directly incorporated into CDR H3, the impact of V germline segments on CDR H3 properties has not been addressed. In part this is due to the absence of any expectation for systematic CDR H3 biases as a function of VH germline, especially in the naïve B cell compartment prior to selection associated with adaptive immune responses. Analysis of the impact of the VL on CDR H3 has been limited to properties of the third CDR of the light chain, without any evidence of biases (16). Finally, analysis of the impact of JH germlines on CDR H3 biases has been confined to the expected effects of differential JH germline length and sequence composition. A recent analysis of a large dataset of isotype-switched human antibody sequences with paired chain information revealed an unexpected preferential pairing of IGHV3-7 (VH3- 7) and Vk2-30 germlines (19) which, in subsequent detailed analysis, was determined to be linked to biases towards shorter CDR H3 lengths associated with both germlines. This prompted us to investigate the extent to which CDR H3 length might be biased as a function of germline use in human immunoglobulin repertoires. Here we describe high-dimensional analyses of CDR H3 sequences from several independently generated human antibody repertoire sequence datasets. Our results uncover biases in CDR H3 and junctional length distributions associated with VH, VL and JH germline segment utilization that shape naïve and immune repertoires in unexpected and unpredictable patterns.

Results
Sequence datasets. In the present study we analyzed sequences from four previously described B cell repertoire deep sequencing datasets including 3 donors each and a fifth dataset with 8 donors (16,(19)(20)(21)(22). We refer to these datasets as TX, WA, CA, MA and SRI. These datasets were sequenced and bioinformatically parsed using a diversity of methods (Table 1), minimizing the impact of methodological biases. For simplicity we refer to the TX CD27 pos /IgG/IgA, CA, MA IgG/IgA and WA CD27 pos subsets as "antigen-experienced" or "AE", the TX CD27 pos /IgM as "AE IgM" and the TX and WA CD27 neg subsets as "naïve". The TX and CA datasets include VH/VL chain pairing information. The SRI dataset was analyzed separately in this study to avoid overrepresentation of donors from a single source in pooled data. No antigen-specific selection of B cells was performed for any of the datasets, although the CA and MA datasets include both preand post-vaccination samples.
We aimed at identifying properties that are shared among donors and not influenced by clonal expansion related to specific immune responses. To minimize the impact of clonal expansion, the datasets were processed to retain a random sequence from each lineage, or clonotype. Clonotype definition varied according to sequencing and parsing method used for each dataset (Table S1). Nonproductive sequences were not grouped by clonotype. Overall distribution of CDR H3 lengths was not significantly affected by removal of redundant sequences in most datasets except for the WA and MA AE compartments, which had subtle but noticeable shifts (Fig.   S1A). Germline-specific analyses were performed with germlines having at least 240 counts for all donors in a dataset, which represent 94 to 98% of the repertoire in the CA, TX and MA datasets.
Due to ambiguities in VH germline calls in the WA dataset, germline-specific analyses in this dataset were performed with 16 VH germlines that had fewer than 10% ambiguous ("in ties") calls in the dataset, totaling about one third of the repertoire (Table S2). Analyses of whole repertoire data included all sequences filtered by clonotype in specific B cell subsets, regardless of germline classification. The overall AE CDR H3 length distributions are similar among datasets, allowing pooling data from different datasets for the AE B cell subset (Fig. S1B). However, the relative CDR H3 length distributions of the WA and TX naïve B cell subsets differed by an average of 0.9 residues (Fig. S1B) and were thus analyzed separately.
Average CDR H3 length varies with VH and VL germline use. As a first step we analyzed average CDR H3 length as a function of VH or VL germline use. Average CDR H3 length in the AE subset varied by up to 3 amino acid residues as a function of VH germline use and correlated well for different datasets, with relatively little deviation from length equivalence for different germlines between datasets (Fig. 1A). Average CDR H3 length also varied as a function of VL germline use by up to 4 amino acid residues in the AE compartment and correlated well between the CA and TX datasets (Fig. 1B). The naïve compartment showed a more limited spread in average CDR H3 lengths relative to the AE compartment ( Fig. 1C-F, blue squares). Significant reductions in average CDR H3 length in the AE relative to naïve compartments were associated with a subset of VH and VL germlines ( Fig. 1C-F). The TX AE IgM subset showed similar trends as the TX AE IgG/IgA subset except that average CDR H3 length was decreased in association with most VH germlines relative to the naïve compartment ( Fig. 1C and E).
CDR H3 length distribution varies as a function of VH germline use. We next determined whether CDR H3 length distribution varies with germline use. Overall CDR H3 length distribution of the respective B cell compartment was used as a relative standard to which germline-specific CDR H3 length distributions were compared. This was done to facilitate comparison of biases between samples and also because useful objective reference distributions are not available to determine bias types in naïve compartment sequences. Therefore, most biases described here, including all in the naïve compartment, are relative to the average of the repertoire in each B cell compartment. Overall and germline-specific CDR H3 length distributions were determined by averaging the frequency of each CDR H3 length for all donors across the TX, CA, MA and WA datasets, with the SRI dataset analyzed separately. Statistical analysis of biases was performed in the AE compartment by a two-tailed paired (by donor) t-test of length frequencies with a sliding window of two consecutive CDR H3 lengths to minimize the significance of local distribution fluctuations. Observed length distribution biases included overall shifts in average CDR H3 length for sequences with different VH germlines and also obvious and subtle deviations from the overall CDR H3 distribution in discrete ranges of the CDR H3 length spectrum ( Fig. 2A, Fig. S2A).
To further discern the CDR H3 length biases quantitatively, we performed a principal component (PC) analysis of the length distributions (lengths 5 to 26) associated with different VH germlines to capture the trends causing highest variations in the observed length distributions.
Results from the PC analysis were visualized by projecting each germline onto the most important trends, to obtain the so-called PC scores (Fig. 3A). Interpretation of the PC score plots was aided by a visual analysis of the corresponding distributions. Overall, the most significant component, PC1, corresponded to apparent skewness towards shorter or longer lengths, whereas PC2 corresponded to apparent kurtosis, or relative enrichment or depletion of sequences in the midrange lengths, of the distributions. Using the PC analysis results in conjunction with visual inspection of VH germline-associated CDR H3 distributions compared to the overall CDR H3 distribution in the AE compartment, germlines were categorized by bias type in discrete groups as "Short", "Neutral", "Long", "Cut" and "Crested" (Fig. 2 and 3A, Fig. S2). These groups have different degrees of shifts towards longer or shorter lengths and kurtosis relative to the overall distribution. Differences between the distributions of member of different groups can be subtle, both visually and in the PC analysis. One example is the difference between the Neutral and Cut groups, the latter showing depletion of sequences limited to a narrow band of short CDR H3 lengths. The magnitude of the biases and the details of distribution shapes within each group varied for different VH germlines. However, these were consistent across datasets for each germline, VH1-69 being a notable exception (Fig. S3). Germlines in the same VH subfamily did not always belong to the same bias groups. The three major VH subfamilies, VH1, VH3 and VH4, are represented in the Neutral group. The range of germline prevalence in the repertoire was similar for different groups except for the higher prevalence of some germlines in the Crested group (Fig. S4).
We determined whether the observed distribution biases were also present in the naïve B cell subset. The Long, Crested and Cut biases were also observed in the naïve B cell compartment, without apparent differences relative to the distributions in the AE compartments ( Fig. 2B and C and Fig. S2B, C, E and F). All the germlines in the Neutral group showed average CDR H3 length distribution in the naïve subset as well. However, distribution biases of the Short group in the naïve compartment were variable ( Fig. S2B and C), consistent with the average CDR H3 length analysis ( Fig. 1E and F). Short biases in the naïve compartment were mostly limited to the VH3-73 and VH3-15 germlines in the TX and WA datasets. Despite the differences in overall CDR H3 length between the TX and WA naïve datasets, the biases in the naïve compartment had the same trends in both datasets ( Fig. 2B and C and Fig. S2B and C).
The data analysis was performed in datasets aggressively filtered for sequences likely to belong to the same lineage. To confirm that biases are not due to pockets of clonal expansion, we performed a repertoire similarity index (RSI) analysis with the CA, TX and MA datasets similar to a recently described method (23). The RSI analysis computes CDR H3 identities within each donor for sequences with the same germline, CDR length and VH/VL pairing (for the paired CA and TX datasets) or VH and JH combination (for the unpaired MA dataset). Clonal expansion would thus be reflected by higher than average RSI values. Overall, no significant increase in RSI scores was associated with regions of positive prevalence biases in different parts of the CDR H3 length spectrum for different bias groups (Fig. S5A), confirming that clonal expansion does not account for the observed CDR H3 length biases.
CDR H3 length distribution bias as a function of VH germline is not generated by VDJ recombination. We next determined whether the biases observed in the naïve compartment are a direct consequence of biases in the VDJ recombination process for each germline. For this we analyzed frameshifted, nonproductive VH sequences that were part of the naïve WA dataset.
Nonproductive recombination products are not directly subject to selection and therefore provide information about recombination products prior to any selection of the repertoire. As previously reported, the CDR H3 length of nonproductive VH genes is significantly longer than the productively recombined genes in mature B cell subsets (15). However, CDR H3 length for the nonproductive sequences associated with different VH germlines approximated a Gaussian distribution, with no observable biases associated with different VH germlines relative to the overall repertoire except for minor anomalies associated with some germlines ( Fig. 2D and Fig.   S2D). Therefore, heavy chain recombination mechanisms do not account for the CDR length distribution biases observed in the naïve repertoire.
CDR H3 length distribution varies as a function of VL germline use. We performed a similar analysis of CDR H3 length distribution as a function of VL germline and B cell compartment using PC and visual analysis. Similar to the VH germline-associated biases, VL-associated biases in the AE compartment could be classified into three groups, named here "Short" (high value of PC1),  S6A). PC1 and PC2 for the light chain were also associated with apparent skewness and kurtosis.
The VL Long bias group has Gaussian CDR H3 length distributions, whereas the VL Short bias group includes distribution shapes with significant deviations from Gaussian, including localized frequency spikes in discrete sections in the short range. Only Vk germlines in the Long group were associated with similar CDR H3 length biases in the TX naïve compartment (Fig. 4B, Fig. S6B and C). The magnitude of the VL-associated biases varied for different germlines within each bias group but were consistent between datasets (Fig. S7). As above, the RSI analysis results indicated that clonal expansion does not account for the VL germline-associated CDR H3 length biases (Fig.   S5B). The prevalence of germlines in the Short group in the repertoire was lower than for germlines of the other two groups (Fig. S4) CDR H3 length is biased as a function of VH/JH combination. JH germlines vary in the number of amino acid residues that can be potentially contributed to CDR H3 from 4 in JH4 to 9 in JH6.
We assessed whether differential JH germline usage as a function of VH and VL germline is the basis for V segment-associated CDR H3 length biases. No significant deviations from average JH usage were observed in association with most VH germlines in the WA unproductive sequences ( Fig. S8A). Although deviations in JH prevalence linked to some germlines were observed in the naïve compartment of both datasets (e.g., VH2-5, VH3-9), those deviations do not readily explain CDR H3 distribution biases associated with these VH germlines (Fig. S8B, C and D). One exception was a single VL germline with a bias for longer CDR H3 lengths, Vk2-28, which was associated with a higher prevalence of the longer JH6 and lower prevalence of the shorter JH4 germline segments (Fig. S8D). Some of these biases were observed in a reciprocal analysis of JH germline usage as a function of VH germline and CDR H3 length in the naïve compartment but not in nonproductive sequences (Fig. S9). These include not only the generally skewed JH usage proportional to JH germline length, as expected, but also higher or lower than average prevalence of the JH5 germline and different JH4/JH6 germline usage ratios in the average to longer CDR H3 lengths.
We next analyzed CDR H3 length distributions associated with different VH/JH germline combinations, comparing these to CDR H3 length distribution of all sequences with the corresponding JH germline. As expected, CDR H3 length distributions were generally shifted according to length of JH the segment in the germline regardless of VH germline (Fig. 5, Fig. S10 and S11). However, a subset of VH-associated CDR H3 length biases were impacted by JH germline in a manner independent of length of the JH segment in the germline, with very similar patterns in the naïve WA and IgM/naïve SRI subsets ( combined with all JH germlines except JH2 and JH6 (Fig. S10), the longest JH germlines. The Cut group was apparent only when overall population distribution is shifted toward shorter CDR H3 lengths in association with the short JH4 germline, making the reductions in the number of sequences in very short range in this group more apparent (Fig. 5). In contrast, the two VH germlines in the Long group that were analyzed, VH1-18 and VH2-26, were associated with long CDR H3 length biases in the context of most or all JH segments. Distributions associated with VH3-9 were unique in that the same peak of sequences with length 14 to 17 occurred independently of JH segment in both datasets. Our results indicate that CDR H3 length distribution biases are not necessarily uniform for each VH germline but may vary as a function of JH germline. In addition, the effect of JH on CDR H3 length distribution is not necessarily similar within VH bias groups, indicating some degree of heterogeneity within bias groups.
Selection of differentially trimmed JH segments associated with different VH germlines in the naïve compartment. The CDR H3 length distribution biases associated with a subset of VH/JH germline combinations may be a consequence of biases in JH trimming as a function of VH germline. JH residue occupancy in the last CDR H3 positions of JH4 and JH5 sequences was used to indirectly determine JH trimming. The JH1, 2, 3 and 6 germlines were not analyzed due to lack of sufficient data or, in the case of JH6, absence of apparent JH segment-associated CDR H3 length biases. No apparent biases in JH residue occupancy relative to overall repertoire were observed for any of the analyzed VH/JH combinations in the nonproductive WA sequences (Fig. S12). However, JH residue trimming biases were observed for different VH/JH combinations in the naïve WA compartment ( Fig. 6 and Fig. S12). General trends in residue occupancy in JH4 were similar in IgM/naïve SRI sequences for the VH/JH4 germline combinations with sufficient numbers for analysis (Fig. S13). Residue-specific trimming biases were found to be mostly coordinated for different JH residues in each analyzed VH/JH combination, as expected due to the directional nature of trimming. However, closely related VH germlines can be associated with distinct trimming biases of different JH4 residues. For instance, VH2-5/JH4 sequences are associated mostly with reduced trimming of IMGT® residues 114 and 115 (Tyr and Phe) whereas in the case of VH2-70/JH4 strongly reduced trimming of residue 116 (Asp) was also observed. The results indicate a complex set of constraints leading to selection of differentially trimmed JH segments in the context of certain VH and JH germlines during naïve repertoire maturation.

Biases in D segment and N-region lengths within CDR H3 sequences as a function of VH and
JH germline use. The observed JH segment length biases in CDR H3 sequences of specific lengths could be an indirect consequence of biases elsewhere in CDR H3, including the length of VH and D sequences and number of N-region and palindromic nucleotide insertions (NP-region) flanking the D region in CDR H3. Naïve sequences from the 3 donors of the WA dataset and IgM/naïve sequences from 3 of the donors with higher number of sequences in the SRI dataset were parsed for VH, JH, D and NP-region lengths within CDR H3. In general, no obvious differences on the prevalence of D germlines with different average lengths were observed in association with different VH and the JH4 and JH5 germlines that would account for JH segment length biases (Fig.   S14). One possible exception is VH6-1 which was associated with shorter D germlines. In addition, the number of nucleotides that VH can contribute to CDR H3 did not correlate with JH length biases ( Fig. 7 and S15). However, different classes of biases in the lengths of D segments and NP-regions were observed for different VH/JH combinations, even for clones with the same VH germline ( Fig.   7 and S15). Biases had similar trends in the WA and SRI datasets, with differences between datasets observed mostly in the magnitude of the biases. No similar biases were observed in the nonproductive WA sequences with exception of differences in average VH-derived sequence lengths associated with VH germline length and a generally shorter NP-region length in VH2 clones sequences. However, as the JH segments of nonproductive VH2 sequences do not appear to be biased relative to the overall repertoire, the observed JH4 and JH5 trimming biases associated with VH2 naïve sequences are presumably due to JH trimming rather than NP-region selection. Overall, the results show different classes of biases in D segment, N-region and JH lengths within CDR H3 of naïve sequences that vary among VH/JH germline combinations.

Discussion
Understanding antibody CDR H3 diversity generation, a process critical for antigen binding, has long been a goal in the immunology and antibody engineering fields. Numerous These datasets were obtained and parsed with different sequencing methods and bioinformatic pipelines, which minimizes the impact of technical artifacts. Some of the biases, such as those associated with VL, JH and D germlines and NP-regions, cannot be easily generated by sequencing or parsing artifacts. However, subtle differences were observed between datasets which may be due to technical reasons, such as the slightly longer average CDR H3 length of sequences in the WA AE dataset compared to other datasets and the significant differences in average CDR H3 length between the naïve compartments of the TX and WA datasets (Fig. 1). However, these differences in baseline average CDR H3 length, which were accounted for in our naïve compartment analyses, did not affect the overall results. The V germline-associated CDR H3 length biases are not related to clonal expansion as sequences were clustered by clonotype. The stringency of clonotype clustering criteria had limited impact on results. This is exemplified by the WA and SRI datasets, which yielded results comparable with other datasets despite having been clustered by clonotype using a higher CDR H3 sequence identity threshold than other datasets. Limited reliability of D germline classification, especially when sequence identity length is short, is well known. However, any D germline assignment errors would not be expected to be associated with particular VH/JH germline combinations sequences. In addition, errors in D parsing should still result in similar length D germline sequence matches, minimizing the impact on D and NP-region length assessments. Thus, the observed D segment length biases associated with sequences with different VH/JH germline combinations, generally similar in the WA and SRI dataset subsets analyzed, are not expected to be a consequence of systematic junction parsing errors. This is further supported by the lack of similar biases in nonproductive sequences of similar lengths parsed by the same method, except for the expected germline-specific VH region length biases in CDR H3.
One factor that was not addressed here is haplotype variations within and between donors.
Haplotypes could potentially affect CDR H3 length distributions through differences in D germline composition and differential recombination frequencies of D or JH germlines of different lengths in different chromosomes (25). This, combined with differential recombination frequencies of VH alleles (26)(27)(28), may impact CDR H3 distributions associated with certain VH germlines. However, heavy chain variable region haplotype differences would not be expected to impact CDR H3 distributions associated with VL germlines and the AE compartment-specific short CDR H3 length biases. In addition, the observation of essentially the same CDR H3 length distribution biases in several donors from five different sources and junctional segment length biases in six donors from two of these sources, along with a lack of systematic associations between VH and D and JH alleles across donors (27,28), indicates that haplotype variations are unlikely to be a major factor in the CDR H3 and junctional length distribution biases described here.
The analyses shown here use germline information as a proxy for undefined sequence features that ultimately determine the observed biases. Therefore, the selected CDR H3 sequence and structural properties that result in the observed biases and the germline sequence properties that determine those biases remain to be identified. Analysis of VH germline residues that can directly encode or bias CDR H3 IMGTÒ positions 105 to 107 did not reveal clear correlations between the number or type of encoded residues and most CDR H3 bias groups or junctional segment length biases (Fig. S17). One exception may be the Cut group, which includes two VH2 germlines with slightly longer extensions into CDR H3. The extended VH sequences, along with the observed reduced JH4 trimming associated with these germlines, may contribute to the low frequency of short CDR H3 sequences in the Cut group. In addition, no obvious correlations between JH trimming biases and variations in VH germline residues in positions 40 to 42 generally contacting the differentially trimmed JH residues 115 and 116 were observed. The differentially trimmed residue 116 is located in a region at the base of CDR H3 that can adopt either a "bulged" or "extended" conformation (29,30). The factors that determine the more common bulged The CDR H3 biases described here pose questions about the functional properties that might shape those biases and the functional consequences of these biases for adaptive immunity.
The emergence of some biases in the naïve repertoire suggests selection against self-reactivity, selection for structural integrity, expression or a combination of these factors as possible mechanisms. One factor that seems not to contribute significantly to most or all of these biases is chain pairing, except perhaps for the association between Vk2-28 and JH6 in the naïve compartment. In agreement with previous reports, no significant association between VH and VL germlines of similar bias types was observed, with one exception being the previously described preferential pairing of the VH3-7/Vk2-30 germlines in the Short VH and VL bias groups in the AE compartment (19,32). If related to selection against self-reactivity, the different biases indicate either that features other than CDR H3 charge and hydrophobicity significantly contribute to self-reactivity or that V segments modulate the self-reactivity mediated by these factors. The bias towards shorter CDR H3 lengths associated with a subset of VH and VL germlines in the AE compartment may be attributable to these same mechanisms or to immune selection. The latter would suggest widespread convergences in human repertoires associated with certain VH and VL germlines or, possibly, some degree of functional specialization in the germline repertoire linked to short CDR H3 sequences, analogous to the association between CDR H3 length and recognition of different antigen classes (33). Our results point to unexpected cross-constraints between VH, VL, JH and other junctional elements selected at different stages of B cell development that significantly shape antibody repertoires.

Materials and methods
Datasets and analysis. Sequences were obtained from the original publications (16,19,20) except for the MA dataset. The sequences in the MA dataset were obtained from a re-sequencing by Illumina MiSeq (34) of the same set of samples previously described by Laserson et al. (21). A summary of the samples used here is given in Table S3. Sequencing methods for the MA dataset are described in the experiment design section associated with sample data (see https://www.ncbi.nlm.nih.gov/sra/SRX2251687). Sequences were used as parsed in the original publications except for the MA dataset, where the raw sequencing files were processed and germlines annotated with a custom pipeline. Briefly, paired-end reads were merged using FLASH (35) to reconstruct the full-length variable domain sequences using the following parameters: read length at 300 bps, expected fragment length at 530 bps, standard deviation at 50 bps. The fulllength sequences were subsequently processed to identify the frameworks and CDR regions using position-weighted motifs as previously described (36). IgBlast (37) was used to supplement the region parsed data with germline annotation for each sequence, including nucleotide somatic mutations. Isotypes of the sequences were determined by finding the closest matching human CH1 isotypes on the available CH1 sequences. Each sequence was processed and annotated with the frameworks, CDRs, germline use and clonotype grouping (see below). Nonproductive sequences in the WA dataset used for analyses were limited to frameshifted sequences in the naïve compartment to minimize the indirect effect of clonal expansion. CDR H3 length of nonproductive, frameshifted sequences in amino acids was set as the nearest integer of CDR H3 length in nucleotides divided by 3. For naïve compartment sequences of Donor 1 of the WA dataset only the D1a repeat was used for most analyses (20). Parsing of D segments was done using Blast (38) after removing the sequences corresponding to VH and JH regions from CDR H3 sequences. An identity of 100% over a span of at least 5 contiguous nucleotides was required for D germline matches. The samples from the D1Nb subset were also included for D segment parsing, removing redundant sequences as described below. All CDR H3 length distributions and germline prevalence analyses were determined using custom scripts and Microsoft Excel 2016. Paired ttests of CDR H3 length distributions were performed using Microsoft Excel 2016. Mann-Whitney tests for distributions were done using GraphPad Prism version 6. The IMGTÒ CDR definition and numbering system is used throughout (39).
Clonotype clustering. Clonotypes in the CA and TX datasets were defined as sequences from the same donor, VH and VL germlines and CDR H3 length with a nominal 57% or greater CDR H3 amino acid sequence identity, which better approximates an average 60% CDR H3 sequence identity across the range of CDR H3 lengths. For the TX dataset IgG/IgA, CD27 pos /IgM and CD27 neg sequences were segregated prior to clonotype clustering. The minor fraction of sequences without germline information in the TX dataset was not clustered into clonotypes. Clonotypes in the MA dataset were defined as sequences from the same donor, VH and JH germlines and CDR H3 length with a nominal 57% or greater CDR H3 amino acid sequence identity as above. IgG/IgA and IgM sequences were also segregated prior to clonotype clustering. Clonotypes in the WA dataset were defined as sequences from the same donor with the same VH and JH germline and same CDR H3 length and sequence. If VH germline information was not available then VH subfamily information was used in lieu, retaining as a representative for the clonotype a sequence with VH germline information if available. If JH germline information was not available then this parameter was ignored, also retaining otherwise identical sequences with available JH information as representatives for clonotypes, if available. Nonproductive sequences in the WA dataset were not processed for clonotype clustering. Clonotypes in the SRI dataset were defined as sequences from the same donor with the same VH and JH germline and same CDR H3 length and sequence.
Only sequences labeled as "productive" in the SRI dataset were analyzed.
Repertoire Similarity Index Analysis. RSI was computed in a manner similar to a previously described method (23). For a given set ! of CDR H3 sequences, all of the same length ", RSI is measured as follows:      Error bars indicate S.E.M. for data available from more than 2 donors. The full set of distributions is shown in Fig. S6.