The fundamental principles of antibody repertoire architecture revealed by large-scale network analysis

The antibody repertoire is a vast and diverse collection of B-cell receptors and antibodies that confer protection against a plethora of pathogens. The architecture of the antibody repertoire, defined by the network similarity landscape of its sequences, is unknown. Here, we established a novel high-performance computing platform to construct large-scale networks from high-throughput sequencing data (>100’000 unique antibodies), in order to uncover the architecture of antibody repertoires. We identified three fundamental principles of antibody repertoire architecture across B-cell development: reproducibility, robustness and redundancy. Reproducibility of network structure explains clonal expansion and selection. Robustness ensures a functional immune response even under extensive loss of clones (50%). Redundancy in mutational pathways suggests that there is a pre-programmed evolvability in antibody repertoires. Our analysis provides guidelines for a quantitative network analysis of antibody repertoires, which can be applied to other facets of adaptive immunity (e.g., T cell receptors), and may direct the construction of synthetic repertoires for biomedical applications.


INTRODUCTION 28
The high diversity of antibody repertoires enables broad and protective humoral immunity, thus 29 understanding their system sequence-related properties is essential to the development of new 30 therapeutics and vaccines 1,2 . The source of antibody diversity has long been identified to be the 36 developed a large-scale network analysis approach, which was based on representing CDR3 78 a.a. clones as sequence-nodes connected by similarity-edges. Specifically, we developed a 79 computational platform that leverages the power of distributed cluster computing, which is able 80 to compute the extremely large distance matrices required for entire repertoires (≥10 5 CDR3 81 sequences, Figure S1). Analysis of circa 10 5 or more sequences is an intractable problem 82 without parallel computing, thus our implementation utilized the Apache Spark 14 distributed 83 computing framework to partition the work among a cluster of machines ( Figure S1B). The 84 construction of large-scale networks is computationally demanding: a large network comprised 85 of 1.6 million nodes (simulated strings) required 15 minutes if the calculation was performed 86 simultaneously on 625 computational cores ( Figure S1C). Computational costs could have been 87 lowered by performing network analysis on a subsample of the repertoire (e.g., 10 3 ), as was 88 done in previous studies [7][8][9][10][11] . However, extensive analysis into sub-networks has revealed that 89 they are not statistically representative of entire networks, specifically in that sub-network 90 measurements are not always representative of key parameters such as degree distribution, 91 betweenness, assortativity and clustering 15,16 . Thus, it was imperative to construct and analyze 92 large-scale networks based on a similarity distance matrix that covers the full clonal diversity of 93 biological antibody repertoires.

94
Comprehensive biological sampling of antibody repertoires was ensured by the usage of high-95 throughput RNA sequencing data (≈400 million full-length antibody sequence reads) from

108
For each sample (n=57, from 19 mice and three B-cell stages), antibody repertoire architecture 109 was based on the pairwise a.a. sequence similarity of all clones (Levenshtein distance (LD) 110 matrix, hereafter referred to as similarity layer, Figure 1A). When two sequences were similar 111 within a defined threshold, they were connected in the repertoire network (i.e., similarities of 1 112 a.a. differences were captured in similarity layer 1, LD 1 , 2 a.a. in LD 2 and so on).

114
Global patterns of antibody repertoire networks are reproducible

115
In order to quantify the extent to which architectural patterns are reproducible across antibody 116 repertoires, we analyzed the base similarity layer in antibody repertoires (similarity layer LD 1 ).

117
The base layer of the network organization provides information regarding the minimal 118 differences (e.g., 1 a.a.) of all antibody sequences that compose the repertoire. Solely the base 119 4 similarity layer, LD 1 , has previously been analyzed to describe antibody repertoires as networks 120 7-11 . Although repertoires varied highly among mice (74-85% clonal variability, Figure

128
Along B-cell development, PC repertoires were five-fold more disconnected than pBC and nBC 129 networks (PC largest component was nearly 5 times smaller than pBC and nBC, Figure 2A),

130
and their centrality was concentrated on specific clones compared to the homogeneously 131 connected clones in pBC and nBC networks (centralization 01 = 0.05, density 01 = 0.01,

137
"#$ = 0.48, ,#$ = 0.41), whereas PC networks were consistently disassortative: their highly 138 connected clones were linked to clones with few connections ( @$ = −0.09, Figure 2B). The 139 characterization of the global patterns of antibody repertoire networks indicated that pBC, nBC 140 and PC repertoires were reproducible. pBC and nBC clones cover a larger diversity space than 141 clones in PC repertoires, where sequence similarity showed to be centralized and targeted 142 towards certain clones.

144
Clonal features of antibody repertoire networks are reproducible

145
Antibody repertoire architecture was also reproducible at the level of clonal (local) features in 146 pBC and nBC networks, which were characterized by a low variability (coefficient of variation, 147 CV) across various clonal parameters. The low variability of clonal parameters in pBC and nBC 148 networks ( "#$ = 2 − 28%, ,#$ = 1 − 24%) was in contrast to the higher variability observed 149 in PC repertoires ( @$ = 13 − 118%, Figure 2C)

160
Closeness analysis revealed that an analogous number of similarity edges were required to 161 access every other CDR3 from a given CDR3 clone in antibody repertoire networks of different 162 individuals, as the similarity of a clone to every other CDR3 clone in the repertoire varied by 163 "#$,,#$ = 17%. Betweenness, the "bridge" function of a clone in sequence similarity, varied 164 slightly across individuals with "#$,,#$ = 28%, suggesting a comparable structure of the 165 similarity route function of CDR3 sequences in these repertoires. These characteristics reflect 166 the transversal diversity of pBC and nBC antibody repertoires where the clones cover a larger 167 space and their similarity is more homogenously distributed at the global repertoire level.

175
CDR3 authority correlated positively with germline V-gene frequency in PC clones ( @HIJKL, = 176 0.39), denoting the potential role of the V-gene usage in the centralization of these networks 177 ( Figure S2G). Thus, certain high frequency V-genes predispose clones to be highly connected

197
In order to prove the tree-/star-like hypothesis and further investigate the sequence similarity 198 space, we performed k-core 24 decomposition and neighborhood analysis ( Figure 2F, G, H). The (0.04% and 0.06% of CDR3 clones in k-core, respectively) were 200-fold smaller than those of 202 PC (8.2%, Figure 2F). Antigen-inexperienced repertoires were thus characterized by larger 203 coreness values (>20), signifying a more layered structure of CDR3 similarity ( Figure S2J, K) 204 and confirming their tree-like structure. Furthermore, the high convergence of V-genes at the 205 core-level of antibody repertoire networks (pBC=50%, nBC=70%, PC=1-10%, Figure 2G), in 206 contrast with the low exact CDR3 sequence core-overlap ( Figure S3A, B, C), suggested a 207 genetically determined origin of the structure.

208
The average CDR3 neighborhood size, which designated the set of similar CDR3 clones along 209 each sequential step of similarity from a certain clone (orders n=1-50), was order-independent 210 in PC and plateaued at 2% of the network, confirming that PC clones were connected to one  Redundancy is a hallmark of robust systems; for example, redundancy in genes with the same 245 function is the main mechanism of robustness against mutations in genetic networks 30 . To 246 investigate the extent of redundancy within antibody networks, we examined whether their 247 architecture at the base similarity layer (LD 1 ) was manifested in higher order similarity layers 248 (LD >1 ). Differences greater than one a.a. between antibody sequences could represent the 249 potential personal scenarios of antibody repertoire evolution (somatic hypermutation 12,13 ), a 250 result of successful survival through selective processes. Specifically, if a clone connected to 251 many other clones in the LD 1 similarity layer mutates into a similar clone at a specific a.a.

252
position, this potential clone will be connected to many clones in the LD 2 similarity layer. Thus,

253
higher order similarity layers can serve as surrogates for the evolution of potential antibody 254 repertoires from antigen-inexperienced B-cell populations.

255
To quantify the extent of redundancy across similarity layers, we calculated the prediction 256 accuracy of LD 1 versus similarity layers LD 2-12 using a leave-one-out cross-validation approach 257 ( Figure 3D, Figure S3H and S3I). Specifically, quantitative redundancy was low in PC

289
Interestingly, the structure of antibody repertoires was fragile to the removal of public clones.

290
The crucial role that public clones play as pillars of antibody repertoire architecture was 291 revealed by large-scale networks, yet future research will need to determine the functional role 292 (antigen specificity) of public clones in a humoral response.

293
We found that antibody repertoires presented intrinsic redundancy across similarity layers. This 294 means that not only minimal differences (1 a.a. of the base layer LD 1 ) but also further 295 diversification (> 1 a.a. differences between antibody sequences) may be hardcoded into the

306
The three fundamental principles of the architecture of antibody repertoires uncovered here 307 through network analysis may serve as a blueprint for the construction of synthetic antibody 308 repertoires, which may be used to simulate natural humoral immunity for monoclonal antibody

512
(C) CDR3 clones were removed randomly at 10%, 50% and 90% from each original repertoire 513 (20 times) and the power-law distribution was fit to the cumulative degree distributions of the 514 remaining CDR3 clones. A p-value=0.1 is indicated as a red dashed line. In PC samples a fit 515 was not feasible after removal of 90% of CDR3 clones (NA).

570
(E) p-values (mean±s.e.m) of the power-law fit for each cohort.

571
(F) One-sided and two-sided p-values (mean±s.e.m) for the discrimination between the 572 exponential (one-sided p-value=1, two-sided p-value=0) and the power-law fits.

611
Data preprocessing and CDR3 clonal analysis 612 Antibody sequences have been preprocessed and VDJ annotated with MiXCR 42 and further 613 filtered to retain only those sequences that had CDR3 length ≥ 4 a.a. and occurred more than 614 once in each CDR3 repertoire data set ( Figure S1A). Clones were defined by 100% a.a.

619
To construct networks (graphs), a sparse triangle matrix of pairwise Levenshtein distances (LD) 620 between CDR3s must first be computed. For small samples (up to 100,000 unique CDR3 621 sequences) such a calculation is relatively quick on a single computer. However, due to the N 2 622 complexity of required calculations, computing the pairwise matrix for samples of >100,000 623 unique CDR3 sequences becomes prohibitively expensive. To perform these computations, we 624 developed software that utilizes the Apache Spark (2) distributed computing framework to 625 partition the work among a cluster of many machines ( Figure S1B). We chose specifically 626 Apache Spark because i) its deployment is very flexible with regard to underlying computing 627 infrastructure and ii) for similarity layers LD >1 , the networks become extremely large and difficult 628 to process. In these cases, our package can take advantage of the Spark GraphFrames 629 distributed graph library 44 , which allows scaling to even larger samples with millions of 630 sequences ( Figure S1C). With this approach we were able to compute the distance matrices for 631 large samples (>100,000 unique CDR3 sequences) within minutes ( Figure S1, B and C).

632
In addition to the computational complexity inherent in creating the distance matrix, the 633 construction of networks for large LD is very computationally and time-wise costly. We therefore 634 avoided constructing networks altogether for calculating the node degrees and instead used a

682
In the context of antibody repertoires, we let N = |V| and L = |E|. The order of a graph N 683 represents the number of its unique CDR3 clones (nodes). The size of a graph L is the number 684 of its CDR3 similarity connections (edges). The degree k, that represents the edges connected 685 to a node, describes the count of all similar CDR3 clones to a CDR3 based on LD. Because the 686 degree indicates how active a node is, it could be interpreted as a measure of how central a CDR3 clone is in the antibody repertoire network. In simpler terms, it quantifies the number of 688 CDR3 clones that are similar to a certain CDR3, and thus the potential development or the 689 evolutionary routes to this CDR3.

690
The is the average number of similar CDR3 clones. The 691 degree distribution P k = N X /N, defined as the fraction of nodes with degree (N X ) in total 692 nodes, represents the fraction of CDR3 clones that have the same number of similar CDR3s.

693
The cumulative degree distribution P X = p Xe

728
Closeness (centrality 21 ) (c) was calculated to measure how many steps were required to 729 access every other CDR3 from a given CDR3 clone in antibody repertoire networks. We

756
Graphic representations were produced using base R 56 and the ggplot2 R package 57 .

757
Heatmaps were produced using the NMF package 58 . Networks and network clusters