IGoR: a tool for high-throughput immune repertoire analysis

High throughput immune repertoire sequencing is promising to lead to new statistical diagnostic tools for medicine and biology. Successful implementations of these methods require a correct characterization, analysis and interpretation of these datasets. We present IGoR -- a new comprehensive tool that takes B or T-cell receptors sequence reads and quantitatively characterizes the statistics of receptor generation from both cDNA and gDNA. It probabilistically annotates sequences and its modular structure can investigate models of increasing biological complexity for different organisms. For B-cells IGoR returns the hypermutation statistics, which we use to reveal co-localization of hypermutations along the sequence. We demonstrate that IGoR outperforms existing tools in accuracy and estimate the sample sizes needed for reliable repertoire characterization.

The adaptive immune system recognizes pathogens by binding their antigens to specific surface receptors expressed on T and B cells. The recent advent of high throughput immune repertoire sequencing (RepSeq) [1][2][3][4] gives us direct insight into the diversity of B-cell and T-cell receptor (BCR and TCR) repertoires with great potential to change the way we diagnose, treat and prevent immune system related disorders. A growing number of algorithms and software tools have been designed to address the new challenges of RepSeq, in particular sequence analysis, germline assignment and clone construction [5][6][7][8][9][10]. However, each receptor sequence can be generated in a large number of ways, or "scenarios," through recombination of genomic segments, insertions and deletions and hypermutations. Standard assignments introduce systematic errors when describing this inherently stochastic process. Quantitatively characterizing the diversity and the biases of these mechanisms remains a challenge for understanding adaptive immunity and applying RepSeq for diagnostics.
We present a flexible computational method and software tool, IGoR (Inference and Generation of Repertoires), that processes raw immune sequence reads from any source (cDNA or gDNA) and learns unbiased statistics of V(D)J recombination and somatic hypermutations. Using these statistics, for each sequence IGoR outputs a whole list of potential recombination and hypermutation scenarios, with their corresponding likelihoods. IGoR's performance at identifying the correct scenario is 2.5 times better than current state-of-the-art methods. IGoR used as a sequence generator produces an arbitrary number of randomly rearranged sequences with the same statistics as in the dataset. Applied to BCRs, IGoR learns a context-dependent hypermutation model to identify hotspots, which allows for a comprehensive analysis of the mutational landscape of BCRs.

RESULTS
Probabilistic assignment of recombination scenarios V(D)J recombination selects two or three segments (V and J for TCR α and BCR lights chains; V, D, and J for TCR β and BCR heavy chains) from a library of germline genes, and assembles them while deleting base pairs and inserting other non-templated ones at the junctions (Fig. 1a). B cell receptors can further diversify through somatic hypermutations during affinity maturation. The recombination process is degenerate, as the same sequence can be generated in many different ways [11]. IGoR starts by listing the possible recombination and hypermutation scenarios leading to an observed sequence in the dataset. It then assigns probability weights reflecting the likelihood of these scenarios. As the example in Fig. 1a shows, explored scenarios can be very different yet have comparable contributions to the sequence likelihood. Since exploring all possible scenarios would be computationally too costly, IGoR restricts its exploration to the reasonably likely ones. Scenario exploration takes from 1 ms up to less than a second per sequence on a single CPU core, depending on the chain (see full distributions of runtimes in Fig. S1). Different recombination architectures and dependencies can be configured within IGoR by specifying dependencies between elementary events (gene choices, deletions, insertions, hypermutations) through an acyclic directed graph, or Bayesian network, as illustrated in Fig. 1b for the case of TCR β chains (see Online Methods for the other used structures).
IGoR functions according to three modes: learning, analysis, and generation (Fig. 1c)

FIG. 1:
IGoR's pipeline for sequence analysis. (a) V(D)J recombination proceeds by joining randomly selected segments (V, D, and J segments in the case of IGH). Each segments gets trimmed at its ends (hashed areas), and a varying number of nontemplated insertions are added between them (orange). Hypermutations (in the case of B cells) or sequencing errors (in red) further enhance diversity. IGoR lists putative recombination scenarios consistent with the observed sequence, and weighs them according to their likelihood. (b) The likelihood of each scenario is computed using a Bayesian network of dependencies between the recombination features (V, D, J segment choices, insertions and deletions), as illustrated here for the human TRB locus. Architectures for TRA and IGH are described in Online Methods. (c) IGoR's pipeline includes three modes. In the learning mode, IGoR learns recombination statistics from data sequences. In the analysis mode, IGoR outputs detailed recombination scenario statistics for each sequence. In the generation mode, IGoR produces synthetic sequences with specified recombination statistics.
of sequences using a Sparse Expectation-Maximization algorithm (see Online Methods). In the analysis mode, IGoR assigns recombination events to sequences in a probabilistic way, by outputing the most likely scenarios ranked by their probabilities, as well as the overall generation probability of the sequence. In the generation mode, IGoR outputs random sequences with specified statistics, e.g. learned from real datasets.

Inference of V(D)J recombination
We used IGoR's learning mode to infer the accurate statistics of V(D)J recombination from four datasets comprised of unique sequences of non-productive rearrangements of three different chains, sequenced either at the levels of mRNA (TCRα chain or TRA, and TCRβ chain or TRB [14]) or DNA (TRB [13], BCR heavy chain or IGH from naive cells [12]), generalizing earlier methods [15][16][17]. Restricting to nonproductive unique sequences allowed us to avoid biases introduced by functional selection. The Expectation-Maximization algorithm converged within a few iterations (see Fig. S2 for convergence of parameters, and Fig. S3 for the case of IGH).
The same TRB insertion and deletion distributions were inferred regardless of the individual, laboratory of origin, or sequencing protocol, and of whether DNA [13] (light blue distributions in Fig. 2) or mRNA [14] (dark blue) was used. By contrast, V and J gene usage varied moderately but significantly across individuals, and even more across sequencing technologies, suggesting possible primer-dependent biases (Fig. S4, see also Fig. S17 for IGH D-J gene usage). Insertions at the TRA V-J junction, and at the TRB V-D and D-J junctions have similar distributions (Fig.2a), as previously reported [17]. IGH have significantly more insertions at the junctions than TCRs, consistent with previous observations [16].
We then validated the learning algorithm on synthetic datasets. Sequences were generated in batches of 10 3 to 10 5 by IGoR with a variable error rate, using statistics inferred from 60bp DNA TRB data. IGoR's learning algorithm was then run on these raw sequences, and the resulting statistics compared to the known ground truth. We found that the inference was highly accurate for datasets of 10 5 sequences and an error rate set to its typical experimental value, 10 −3 ( Fig. 3a and b), and was not affected by overfitting. However, not all high-throughput sequencing datasets reach this depth, especially when restricted to unique non-productive sequences. In addition, hypermutation rates in BCRs, which IGoR treats in the same way as errors, can reach 1-10%. To assess how these limitations affect accuracy, we calculated the Kullback-Leibler divergence (a non-parametric measure of difference between probability distributions, see Online Methods) between the true distributions and the inferred ones, for varying sizes of datasets and error rates. For an er- ror rate of 10 −3 , ∼ 5000 unique out-of-frame sequences (which can be obtained from less than 2ml of blood with current mRNA sequencing technologies [14]) were sufficient to learn an accurate model of TRB (Fig. 3c), with the majority of the estimation error due to deletion profiles (which account for the majority of parameters). Increasing the error rate has little effect up to rates of 10 −2 , but significantly degrades accuracy for typical hypermutation rates, 10 −1 (Fig. 3d), with the gene usage distribution affected the most (Fig. S5). This suggests that the recombination statistics of BCRs should be inferred using sequences from naive, non hypermutated cells (as we did in Fig. 2).

Analysis of scenario degeneracy
By considering all possible recombination scenarios for each sequence, our approach departs significantly from most existing methods, whose goal is to find the most likely one. To assess how often the most plausible scenario is the correct one, we analyzed synthetic sequences for which the generation scenario is known. For each generated sequence, we used IGoR's analysis mode to enumerate the set of scenarios that were consistent with the nucleotide sequence, and ranked them according to their likelihood. Fig. 4a shows the distribution of the rank of the true recombination scenario for TRB and IGH synthetic data. The maximum-likelihood scenario is not the correct one in 72% of IGH sequences and 85% of 60bp TRB sequences. The distributions have long tails, meaning that a substantial fraction of sequences have a very large recombination degeneracy.
We then estimated how many scenarios, ranked from most likely to least likely, were needed to explain a given fraction f of the total sequence likelihood. The distributions of this number across 100,000 generated sequences are shown in Fig. 4b for various values of f (see Fig. S6 for the equivalent plot for TRB data). To enumerate the correct scenario with f = 95% confidence requires to include at least 30 to 50 scenarios. This analysis indicates that many scenarios need to be considered to correctly characterize the generation process.
IGoR outputs the probability of generation of the processed sequences, by summing the probabilities of all their possible scenarios, which deterministic assignment methods cannot do. It was shown that this generation probability was predictive of sharing properties between healthy individuals [14,15]. This functionality could be used as a useful indicator of convergent recombination in studies attempting to identify antigen-specific or autoimmune related sequences from large clinical datasets.

Comparison to other methods
We compared our method to two representative stateof-the-art algorithms: MiXCR [8], an efficient assignment tool that finds the best matching germline genes, and Partis [10], a BCR-specific tool that uses maximum likelihood to find the most plausible scenario. 130 base-pair IGH sequences were synthetized in silico from a datainferred model using IGoR's generation mode. We then assigned recombination scenarios using MiXCR, Partis  Validation on synthetic data. Short synthetic reads of recombined TRB or IGH sequences were generated with known recombination statistics, and given to IGoR as input to re-infer these statistics. Inference with 10 5 sequences and a typical sequencing error rate of 10 −3 gives excellent agreement for (a) gene usage and insertion statistics and (b) deletion statistics (Pearson's r for deletions is calculated on the joint statistics of gene usage and deletion number; cross size scales with gene usage). (c) Discrepancy between true and inferred values of the recombination statistics, measured by the Kullback-Leibler divergence, as a function of the number of unique sequences in the sample, and decomposed according to the features of the recombination scenario. (d) Same as (c), for increasing rates of sequencing errors or of hypermutations.
and IGoR, and compared them to the true scenarios with which sequences were generated. In IGoR's and Partis' case, the model parameters were learned from the generated dataset to mimick the analysis of real data. Fig. 4c shows the performance of the three methods in assigning the correct scenario of recombination. IGoR performs about 2.5 times better than MiXCR and Partis in pre-dicting the complete recombination scenario, as well as each of its individual components. Note that Partis does not include palindromic insertions, which both IGoR and MiXCR treat by appending a short palindromic sequence at the end of each germline segment; restricting the analysis to sequences generated without palindromic insertions makes Partis' performance comparable to that of MiXCR (Fig. S7). Next, we compared the recombination statistics learned by the three methods to the true statistics used to generate the data. For MiXCR and Partis, we built the distribution of recombination events assigned to each sequence, while for IGoR these distributions were inferred using Expectation-Maximization, as explained before. All three methods yield similar statistics for V and J gene usage and deletion profiles (see Fig. S8). However, the dependency between D an J usage in TRB is correctly captured by IGoR but not by the other methods (Fig. 4d). TRB D and J genes are organised in two clusters, one containing D1 followed by genes of the J1 family, the other containing D2 followed by genes of the J2 family. Because of this organisation, D2 cannot be recombined with genes from the J1 family [18]. MiXCR assigns 20% of impossible D2-J1 recombination events to sequences (note that Partis does not process TCRs). By constrast, IGoR correcly learns the rule by assigning zero frequency to these impossible D-J pairs. The same results are obtained directly on real data (see Fig. S9). Finally, IGoR accurately reconstructs the distribution of insertions, while the other methods systematically overestimate the probability of zero insertions ( Fig. S8a and  b).

Somatic hypermutations
To study patterns of SHMs in BCR expressed by memory B cells, we included into IGoR the possibility to infer a sequence-dependent hypermutation rate. The probability of error or mutation at a given position on the nucleotide sequence is assumed to depend on its immediate n-mer context (see Fig. 5a), through the logistic transformation of an additive score computed using a position weigth matrix (PWM), similar to binding energy motifs used to describe DNA binding sites [19]. We ran IGoR on memory out-of-frame IGH sequences from Ref.
[12] to learn 7-mer PWMs, as well as overall mutation rates (the geometric mean of the mutation rate over all possible 7mers), while fixing the recombination statistics to those previously learned from naive sequences, using Expectation Maximization (see Online Methods). IGoR's probabilistic framework handles the degeneracy of sequence origin caused by convergent combinations of gene choices and hypermutations. The learning procedure differs crucially from Ref. [16], where the hypermutation rate was uniform. Three distinct PWMs were learned for V, D, and J templated regions (Fig. 5b). To validate our PWM and mutation rate learning algorithm, we generated syn-  thetic data with hypermutations according to the model learned from the real dataset, and re-learned its parameters using IGoR, finding excellent agreement (Fig. S10).
The PWM prediction for the position-dependent probability of hypermutations correlated well with that actually observed in the sequences (r = 0.7 for V genes, see Fig. 5c and Fig. S11). PWMs were very reproducible across the two tested individuals (r = 0.98, Fig. S12), indicating that the inference procedure is robust to the individual history of infections, and pointing to the universal nature of the SHM mechanism. By constrast, the inferred overall mutation rate differred by a two-fold factor between the two individuals, probably owing to differences in age, past infections, or lifestyle (Fig. S12). The motifs we found recapitulate previously reported hotspot motifs (positive values of the PWM) for every gene, including WRCY (or WRCH [20]) and WA [21, 22] (W = A or T, Y = C or T, R = G or A; mutated position underlined), as well as cold-spot motifs albeit to a lesser extend (SYC, where S = C, G) [23]. In all three motifs, C and G are generally underrepresented, except for the mutated position in V and D genes where T is less mutated than others. We assessed the robustness of the model to n-mer length by learning PWMs of sizes ranging from 3 to 9 (Fig. S13). The contributions of each relative position did not change substantially as a function of context length. Positions at least up to 4 nucleotides away from the mutation locus contribute to the motif. This could mean that the context dependence is broad, or alternatively that the motif model is indirectly capturing non-contextual effects. Overall, the inferred PWMs give both a more detailed and more nuanced view of the rules that govern hotspot positions, and cannot be reduced to a few easily describable motifs. Fig. 5b shows that the motifs differ substantially between V, D, and J genes. V-learned PWMs only moderately predict J-gene hypermutation rates (r = 0.5 versus r = 0.7 for V-gene rates), and J-learned PWMs predict V-gene rates even worse (r = 0.24, see Fig. S11). This disagreement indicates that predictions purely based on context-dependent motifs are insufficient to explain all of the variability in hypermutation probabilities, and that other mechanisms must be at play. The overall mutation rate was also different between germline genes, consistent with reports that the chromatin state affects hypermutation rates [24][25][26]. We then used the inferred PWM within IGoR to probabilistically call putative hypermutations in sequences. We first examined the distribution of the number of mutations in a sequence (Fig. 5d). The empirical distribution (red) is more skewed and has a longer tail than would be expected by assuming independent hypermutations in each sequence, as predicted by generating randomly hypermutated sequences with the inferred PWM (blue). This observation is consistent with the fact that different B cells have undergone a variable number of cycles of affinity maturation, resulting in differences in effective hypermutation rates. Second, we asked whether hypermutations co-localized within the same sequence, by calculating the enrichment of hypermutations at two positions as a function of their genomic distance (Fig. 5e). While this enrichment is 1 in synthetic sequences (since our model assumes that hypermutations are independent of each other), real data shows up to a 4-fold enrichment of hypermutations at nearby positions. This difference is consistent with the fact that AID can cause repairs of DNA over large regions [27]. The typical distance at which the co-localization enrichment index decays gives an estimate for the length of these correlated regions of hypermutations, about 15 base pairs. IGoR can in principle calculate the generation probability of any sequence. However, highly hypermutated sequences pose an additional challenge because the ancestral (unmutated) recombined sequence itself is sometimes not known with certainty. To overcome this issue, IGoR explores for each sequence all possible recombination and hypermutation scenarios, and calculates the generation probability of each potential ancestral sequence. Using synthetic data, we checked that the generation probability of individual sequences is well predicted by this method (r = 0.97, see Fig. S14 and Online Methods), and its distribution accurately reproduced (see Fig. S15).

DISCUSSION
By treating alignments of immune receptors to the germline probabilistically [15], IGoR corrects for systematic biases in the estimate of V(D)J recombination statistics, and predicts recombination scenarios more accurately than previous methods. Its detailed analysis of recombination scenarios further reveals that, even with a perfect estimator, the scenario is incorrectly called in more than 70% of sequences, suggesting caution when interpreting results from deterministic assignments.
Although we demonstrated its functions on human TRB and IGH, IGoR's flexible structure makes it applicable to any variable lymphocyte receptor (TCR or immunoglobulin) and species for which genomic data is available. Unlike Hidden Markov Model based methods (e.g. [10,17]), it can include a wide array of possible dependencies between the recombination events. It can also be adapted to handle unusual or incomplete rearrangements (D-J rearrangments, DD2/DD3 rearrangements in TCR δ chains, hybrid TRA/TRD recombinations, etc.). IGoR can also help detect unusual rearrangement features by using its syntheticaly generated sequences as a control. For instance, rearrangements with tandem Ds have been reported [12], but distinguishing them from random insertions can be challenging. To test this, we counted sequences with two ≥10-nt D segments in the data, and compared it with predictions from IGoR's synthetic sequences generated with a single D segment (see Online Methods). We found 5 times more double-D assignments in IGH data than in the control, validating the findings of [12]. In contrast, the same analysis performed on TRB showed no significant presence of tandem Ds. Future versions of IGoR should include the possibility of including multiple D rearrangements. Note that IGoR does not find reversed Ds in IGH (Fig. S18).
IGoR infers recombination statistics from nonproductive sequences only, but can do it with as few as 5000 sequences. Once a recombination model is learned for a given locus, IGoR can generate arbitrary numbers of synthetic sequences with the same statistics, which could be used as a control in disease-association studies, by helping to distinguish antigen-specific clonotypes from public sequences with high convergent recombination frequencies, and thus dispense with the need of a healthy control cohort.
Our analysis of hypermutations led us to infer distinct sequence motifs for mutation targets on the V, D and J segments of human IGH, in contrast with previous approaches that assume a universal context model [28]. Although our motifs were learned on short reads comprising only part of the V and J segments, analysis of synthetic sequences showed that motifs could be accurately learned from such short reads. Exploring their applicability to longer reads would be an interesting future direction. We further found that hypermutations tend to co-localize along the sequence. These results suggest that at least three effects determine hypermutation hotspots: the immediate DNA context of the hypermutation, as modeled by our sequence motifs, position-specific effects mediated by e.g. chromatin configuration and histone modifications, and the co-occurence of nearby mutations. Future improvements of hypermutation target predictions will have to rely on a better understanding of the precise mechanisms of AID operation [26].
Software availability. IGoR along with example datasets and pre-learned human TRA, TRB, and IGH models is available at bitbucket.org/qmarcou/igor. Acknowledgements. The work was supported by grant ERCStG n. 306312.

ONLINE METHODS
IGoR functions according to three modes: VDJ statistics learning, sequence analysis, and sequence generation. All modes rely on an explicit stochastic description of the recombination and hypermutation events. In the analysis and learning modes, each sequence is analysed by listing all possible recombination and hypermutation scenarios. The learning mode iterates the analysis mode by updating the model parameters according to an Expectation-Maximization algorithm.

Recombination model
In all three modes, IGoR assumes that receptor sequences result from a recombination scenario comprising several stochastic elements -choice of germline segments, deletions and insertions. These features are stochastic and share statistical dependencies with each other. For tractability, we assume that these dependencies can be represented by an acyclic graph, also called Bayesian network (see SI Text for details). This structure can be configured within IGoR's setup files. For the purpose of this study, we used the following dependency structures for the α chain of T cells (TRA): and for the β chain of T cell receptors (TRB) and heavy chain of B cell receptors (IGH): where V, D, J denote the choice of germline genes, delV , delJ the number of deleted base pairs at the ends of the V and J segments, delDl, delDr the number of deletions at the left and right ends of the D segments, insVJ, insVD, insDJ, the numbers of insertions at each of the insertion sites (between V-J, or V-D and D-J), and n i , m i the identities of the inserted base pairs. In the case of TRB, gene usage is further factorized as P (V, D, J) = P (V )P (D, J)

Context dependent hypermutation model
When processing TCRs or naive BCRs, a constant error probability is assumed throughout the sequence. When processing memory BCRs, a context-dependent hypermutation model is assumed: at each position along the V, D, and J genes, a hypermutation occurs with probability P mut , with where (π −m , . . . , π m ) is the (2m + 1)-mer sequence context centered around the location of the mutation. The entries of the position weight matrix (PWM), e i (π), contribute additively to the motif, and µ is the overall hypermutation rate.

Alignment to germline and scenario listing
In the analysis and learning modes, each sequence is first aligned to all possible germline genes retrieved from genomic databases (e.g. IMGT), using the Smith-Waterman algorithm [29]. Only germline genes with alignment scores higher than an adjustable threshold are considered for further analysis (see SI Text for details). Possible scenarios are then listed by picking germline genes with an above-threshold alignment score, and by choosing a number of base pairs to further delete from the ends of their aligned parts. The base pairs located between the germline segments trimmed in this manner are called insertions, and alignment mismatches to the germline are called errors or hypermutations. When the palindromic end of germline genes is not entirely deleted, the number of remaining palindromic base pairs are described as negative deletions. To allow for the possibility that the D segments be inserted in both directions in BCRs, we added the reverse complements of each D germline segment to the list of genomic templates.

Sequence analysis
For each sequence in the dataset, the probability of possible scenarios are computed using the recombination probability of Eqs. 1 or 2, multiplied by the probability of errors or hypermutations P err : P scenario = P recomb × P err . Scenarios are then listed in order of decreasing probability. The sum of probabilities P recomb × P err of possible recombination and hypermutation events gives the probability of observation of that particular sequence read, P read . The probability that the pre-mutation sequence was generated by recombination, P gen , is defined as the sum of the probabilities P recomb of scenarios leading to that sequence. Since the pre-mutation sequence is not known with certainty, we calculated an approximate generation probability P gen as the geometric mean of P gen of all possible unmutated sequences consistent with the read, weighted by their posterior probabilities, P gen × P err /P read . Alternatively, we approximated P gen by that of the most likely pre-mutation sequence.
To shorten computation times, only plausible scenarios are listed by IGoR. Scenarios are enumerated by exploring the nodes of a hierarchical decision tree, where each depth corresponds to the choice of a scenario feature. Branches of the tree are discarded if their total contribution to the sequence probability is upper-bounded to be below a certain threshold. Details of the procedure are given in the SI text.

Learning algorithm
The learning algorithm infers the parameters of Eqs. 1 or 2, as well as the error or hypermutation model parameters of Eq. 3, from a large datasets of unique sequences. It relies on the sequence analysis module, and follows an Expectation-Maximization procedure. Starting from an arbitrary (but reasonable) set of parameters, all sequences in the dataset are analysed as described above, producing a long list of scenarios associated with each sequence. We define the pseudo-loglikelihood as the weighted sum of the log-likelihoods of all scenarios of all sequences, where the weights are given by the conditional probabilities of scenarios given the sequence, P recomb /P read (Expectation step). This pseudolog-likelihood is then maximized with respect to the parameters of the log-likelihoods (Eqs. 1-3), while keeping the weights fixed. The parameters are updated, and the procedure repeated, until convergence. Mathematical derivations of the update rules and details about Expectation-Maximization are given in the SI Text.

Validation of model inference
To compare the model parameters θ 1 inferred from synthetic data to the known model parameters θ 2 from which these data were generated, we computed the Kullback-Leibler divergence between two probability distributions, D(θ 1 θ 2 ) = E P (E, θ 1 ) log[P (E, θ 1 )/P (E, θ 2 )], where the sum is over all scenarios E. P (E, θ) is computed using Eqs. 1 or 2. This Kullback-Leibler divergence can be decomposed into additive contributions from each of the scenario features, as detailed in the SI text.

Correlations between hypermutations
To evaluate correlations between the occurence of hypermutations at close-by positions along the BCR sequence, we computed the radial disbribution function defined as: where f (i, V ) and f (i, j, V ) are the frequencies of hypermutations at position i, and at both positions i and j, respectively, calculated from individual scenario statistics weighted by their posterior probabilities. C V (r) is the set of pairs of positions separated by r that were observed a large enough number of times in gene V , and N r = V |C V (r)|.

Usage of tandem D segments
In order to assess the occurrence of double D insertions during the VDJ recombination event of IGH or TRB, we computed the frequency with which one could align (with the Smith-Waterman algorithm) two non-overlapping Ds over least 10 nucleotides, between the best V and best J alignments. We then compared the frequency obtained for synthetically generated sequences, to that obtained for real sequencing data.

APPENDICES A Model definitions
We start by giving the particular model structures used in this study. We then give a more general definition applicable to other general types of recombination products.

Models for TRA, TRB and IGH
We define a probabilistic model for each type of chain (e.g. α, β, heavy, light) that describes the probability of each recombination event E by the probabilities of the known elements of the recombination subprocess (gene choice, insertions, deletions at each of the junctions etc) for each chain, and assumes only the minimum correlations between the subprocesses needed to explain the correlations observed in the data. We model insertions as a Markov chain (the identity of an inserted nucleotide only depends on the previously inserted one) with a nonparametric length distribution [14][15][16]. For each insertion site (X= VD and DJ for β and heavy chains and X=VJ for α and light chains) we infer the probability of observing a non-templated sequence of a given length, P (insX), and the transition matrices P VJ (n i |n i−1 ), P VD (n i |n i−1 ), P DJ (m i |m i−1 ) giving the probability of inserting a given nucleotide as a function of the identity of previous one. For each gene we infer the probability of the number of deletions conditioned on the gene identity, e.g. P (delV |V ) for deletions from the V gene. We model templated palindromic insertions as negative deletions [15,16]. The D gene is very short and may get fully deleted. This introduces correlations between the deletions on both sides of the original D gene template. We account for these correlations by inferring the joint probability P (delDl, delDr|D). We treat every allele as a different gene [16] and infer the joint gene usage P (V, D, J) for β and heavy chains, and P (V, J) for α and light chains, to be able to capture correlations between segment usage.
For TCR α chains or BCR light chains, the probability of a recombination event E = (V, J, delV, delJ, insVJ) is: Similarly, the probability P β/h recomb (E) of a recombination event E = (V, D, J, delV, delDl, delDr, delJ, insVD, insDJ) for a TCRβ or BCR heavy chain is: In the case of TRB, gene usage is further factorized as P (V, D, J) = P (V )P (D, J).

General model formulation
IGoR is designed in a modular way so the user can define arbitrary model forms. The models are Bayesian networks encoded as directed acyclic graphs, whose vertices i = 1, . . . , K label individual recombination subprocesses E i (V, D, J choices, deletions, etc. in the examples above). Dependence of the recombination process j upon i is encoded by a directed edge between i and j, denoted v ij = 1 (while v ij = 0 means no direct dependence). The set of parents of i, i.e. processes on which i depends directly, is denoted by P i = {j|v ji = 1}.
Using these definitions we can, generally and irrespectively of the assumed form of the underlying model of recombination, write the probability of a complete recombination scenario E = (E 1 , . . . , E K ) as: where θ denotes the underlying model parameters (i.e. probability distributions of gene choice, insertions at a given junction, and deletions from a given gene in the studied examples). Each recombination scenario E leads to a unique sequenceŜ(E) = (Ŝ 1 , . . . ,Ŝ L ),Ŝ i (E) ∈ {A, C, G, T } (in the following we often write S forŜ(E) for brevity). However, in order to produce a given sequence S several scenarios might be equivalent, and we can write the probability of generating a given sequence as: The above description only holds to assess the generation probability of a pure product of recombination and does not account for sequencing errors or hypermutations. Note that, since longer reads allow for more reliable determination of V and J gene segments, P gen depends in general on read length: shorter reads can be created in more ways than longer reads, leading to larger P gen .

Errors and hypermutations
Sequencing is inherently noisy and introduces nucleotide substitutions. In addition, BCRs can accumulate hypermutations, which can be mathematically treated in the same way as errors. For the sake of clarity, we distinguish between the sequencing read R and the original sequence S resulting from recombination, as defined above. For simplicity we ignore insertion and deletion errors, so that R and S are of the same length L.
We define our error model as deviations from the initial recombination event (through sequencing errors or somatic hypermutations) such that P err (R|S, θ) is the probability of observing the sequencing read R given the recombination product S. Since the recombination scenario E completely determines S, P err (R|S, θ) = P err (R|E, θ), and we use these two notations interchangeably. The dependence on θ reflects the fact that θ also includes the parameters of the error or hypermutation model.
We write the joint probability of producing a given sequence S and observing a given read R as: P (R, S|θ) = P gen (S|θ)P err (R|S, θ).
Summing over all possible recombination products, the likelihood of a sequencing read is: and the total likelihood of the model given a dataset of reads (R 1 , . . . , R N ) is given by: B Expectation-maximization

General scheme
The recombination machinery is degenerate, as several scenarios of recombination and hypermutations can lead to the same sequence, and the recombination scenario E from which the sequencing read R comes from is in general unknown. The Expectation-Maximization algorithm is a commonly used algorithm that maximizes the likelihood of models with hidden variables given the data. In this section we re-derive this algorithm for our class of models.
The procedure is iterative. Starting from an initial set of parameters θ, one wishes to update another set of parameters θ . From Bayes formula, P read (R|θ ) = P (E, R|θ )/P (E|R, θ ), we rewrite the log-likelihood of a read as: where we have used E P (E|R, θ) = 1, and where we have defined The difference between the log-likelihood, ln L total (θ) = N a=1 ln P read (R|θ), between the current set of parameters θ and the candidate new parameters θ reads: where Q(θ |θ) = N a=1 q(θ |θ, R a ), and where we have used Gibbs inequality: This inequality ensures that maximizing the "pseudo-log-likelihood" Q(θ |θ) over θ increases total likelihood by at least the same amount. The Expectation-Maximization scheme updates θ by doing such a maximization, and repeating the procedure iteratively. The algorithm converges to a maximum of the likelihood.
In order to maximize the pseudo-log-likelihood of the recombination model we need to maximize Q recomb (θ |θ) with respect to every model component contained in the parameter set θ , P (E i |{E j } j∈Pi ). We impose normalization using Lagrange multipliers, λ i , and define: Taking the functional derivative ofQ recomb (θ * |θ) with respect to the model parameter we get: Setting this derivative to zero gives: where the Lagrange parameter λ i = N ensures normalization. In other words the modified log-likelihood is maximized by using an update rule that equates the probability of a realization of a recombination event to its posterior frequency.
3 Optimizing the independent single nucleotide error model The independent single nucleotide error model is the simplest instance of an error model, where each nucleotide of the read has a probability r to be mis-sequenced as one of the three other nucleotides with equal probability. For this model we have where N err (R, S) the number of mismatches between R and S, and L the number of error-prone base pairs. We compute the derivative of the modified log-likelihood of the error model with respect to R * as: Setting this derivative to zero yields: where L(R a , E) is the number of potentially erroneous nucleotides in read a. For simplicity we ignore errors and hypermutations in the insertion part of the sequence, as they are almost indistinguishable from unmutated random insertions, and accounting for them would imply summing over an exponentially large number of scenarios. As a result, L in the above formula is not the read length, but rather the number of genomic nucleotides in each scenario, which depends on the scenario E as well as on the sequence read.

Optimizing the hypermutation model
The hypermutation model assumes the following form for the probability of hypermutations: with where (π −m , . . . , π m ) = (S x−m , . . . , S x+m ) is the sequence context of the original recombination product around a hypermutation at position x. The parameters e i (N ), the position-weight matrix, and µ, the overall mutation rate, are part of the parameter set θ. In order to lift the degeneracy of the model we impose that N =A,C,G,T e i (N ) = 0 at every position i. The pseudo-log-likelihood of the hypermutation model reads: where r (S, x) = r (S x−m , . . . , S x+m ) = µ exp m i=−m e i (S x+i ) . It can be rewritten as: where During the Expectation step, we compute these two quantities for each (2m+1)-mer and then maximize Q err at each step of the Expectation-Maximization scheme using Newton's method with a backtracking line search. To impose σ e i (σ) = 0 we remove one parameter per position i by setting for one nucleotide, e i (N ) = − σ =N e i (σ). We can then compute the entries of the gradient vector J (of size 3(2m + 1) + 1): along with the Hessian matrix H entries: For each step of Newton's method we find the step direction by solving H∆θ = −J and we gradually refine the step size based on the Armijo-Goldstein condition. These operations are iteratively repeated until the pseudo-log-likelihood of the error model for a given Maximization step of the EM framework is maximized.

C Model entropy and D KL
Shannon's entropy [31,32], is a measure of the uncertainty about the outcome of a stochastic process described by a variable x, governed by the distribution p(x|θ) and parametrized by θ. As in [15][16][17] we compute this quantity based on our probabilistic framework and use it as an estimate for the diversity generated by the V(D)J recombination process. In the main text we also introduced the relative entropy or Kullback-Leibler divergence, as a measure of dissimilarity between two probability distributions parametrized by θ 1 and θ 2 respectively, and used it to quantify the error made by our probabilistic framework upon inferring the V(D)J recombination parameters. Since both the entropy and the Kullback Leibler divergence between two recombination models can be computed once one knows how to compute the cross entropy H(θ 1 , θ 2 ) = x p(x|θ 1 ) ln p(x|θ 2 ) between the distributions for the two sets of parameters θ 1 and θ 2 , we focus here on the computation of H(θ 1 , θ 2 ).

General form
For the considered class of models, the cross-entropy can be divided into subparts for each model component, with To calculate this sum, one does not need to sum over all possible scenarios E, but only on combinations of processes that affect E i directly or indirectly. Let us call A i ⊂ {1, . . . , K} the set of indices affecting process i. These are defined as the "ancestors" of i in the acyclic graph, i.e. indices j such that there exists a lineage from j to i, (i 1 = i, i 2 , . . . , i k = j) with i +1 ∈ P i (note that A i includes i itself as a 0th order ancestor). Then the previous sum can be reduced to a sum over the processes in A only: where E Ai denotes the subvector of elements of E with indices in A. Estimating the cross entropy for an event E i requires exponential time in the number of ancestors of that node. Fortunately, in the recombination models considered in this paper the set of ancestors are small and obtaining the cross entropy is easy for every event. The special case of insertions is discussed below. Note that this cross-entropy only takes into account the recombination model, and not the error model.

D Probability of generation
Although the probability of generation of a sequence without errors or hypermutations is well defined, computing the probability of generation of a mutated sequence, before mutations occurred, is strictly speaking not possible because that sequence is not know with certainty. However, we can compute a good approximation for it, and we can also calculate its distribution across sequences.
To approximate P gen (S) from a noisy or hypermutated sequence R, we take its geometric average weighted by the probability of the recombination product S: with P (E|R, θ) = P recomb (E, θ)P err (R|Ŝ(E), θ)/P read (R, θ). Alternatively, one can take the generation probability of the most likely recombination product: where S * = argmax S P (S|R, θ).
The distribution ρ(x) of the log-probabilities of generation, x = log P gen , can be computed from data using: Note that unlike estimates for single sequences, this expression should become exact in the limit of N → ∞.

E Data and software
1 Genomic templates We used custom genomic templates derived from the IMGT database [33]. TCR alpha V and J genomic templates were taken from the IMGT human database. For TCR beta V, D and J genes we used curated genomic templates from [15]. BCR heavy chain V, D and J genes were taken from the customized genomic templates used in [16]. For software comparison we used default genomic templates provided with Partis and MiXCR.

Alignments
Initial alignments to germline genes were performed using the Smith-Waterman algorithm [29], with scores of 5 for matching base pairs, -14 for mismatches, and a 50 gap penalty. Alignments with a score below the following gene dependent threshold were discarded: 50 for TRBV, 0 for TRBD, 10 for TRBJ, 20 for TRAV, 10 for TRAJ, 50 for IGHV, 40 for IGHD, 10 for IGHJ. We also discarded alignments whose score fell below the maximum alignment score (found for this read and segment type), minus the following variable range: 55 for TRBV, 35 for TRBD, 10 for TRBJ, 55 for IGHV, 20 for IGHJ.
The alignment offset (the index of the nucleotide on the read to which the first letter of the undeleted genomic template is aligned) was constrained depending on known primer locations on the J gene.

Pruning the tree of scenarios
Since enumerating all possible scenarios for each sequence is not tractable, we used a heuristic method for reducing their numbers. Exploring all possible scenarios is equivalent to exploring all the terminal leafs of a tree. Our heuristic is to prune all branches that do not contribute substantially to the likelihood of the read. To do this we implement a Sparse Expectation Maximization algorithm as motivated in [34]. Due to the acyclicity of the directed graph underlining the Bayesian network, there exists a topological sorting of the events constituting a partially ordered set (we will assume in the following that the indices of the different events E i respect this ordering). IGoR processes event realizations according to this order corresponding to different layers of depth in the tree. To discard irrelevant branches (containing negligible scenarios) IGoR computes at each depth k (with 0 ≤ k < K) an upper bound on the probability of the currently explored scenario: where E is the set of already fully explored scenarios, and 0 ≤ ε ≤ 1 is a tunable parameter setting the precision of the sparsity approximation. While ε = 0 will explore every possible scenario and perform an exact Expectation step, ε = 1 will explore only scenarios more likely that any scenario already explored. Although Eq. 47 captures the essence behind our tree pruning approach, in practice IGoR uses more information than a simple upper probability bound. By picking two gene choice realizations, imposing the identity and position of these specific V and J genes, we explicitly impose the total nucleotide length of event realizations between those V and J genes (number of insertions, deletions, D gene length, ...). When computing the probability upper-bounds IGoR computes the upper probability bound for a given junction length between two event realizations, and uses this refined bound to efficiently prune the tree of scenarios.

Generating synthetic sequences
Synthetic sequences are generated by randomly drawing scenarios of recombination from the probability distribution in Eq. 4 or 5. In order to fit the data, the resulting sequences are then cut to mimic the sequencing process (e.g. fixed starting point and fixed read length). We benchmarked IGoR's performance for evaluating possible recombination scenarios on real data sequences used to infer the models presented in the main text. We used 60bp TCR β sequences for benchmarking since the difficulty for finding the correct V and J for alignment is higher. Finding the Most Likely Scenario Only(MLSO) is on average 3× faster than evaluating all possible scenarios. Restricting possible scenarios to deterministically assigned V and J genes is on average 6× faster(data not shown).

C.
FIG. S4: Gene usage in TRB mRNA vs DNA data. We plot the marginal gene usage averaged over conditional dependencies for V, D and J genes respectively inferred using IGoR from mRNA 100bp (red) and DNA 60bp (blue) technology datasets. We observe a higher inter-method than inter-individual variability.  Fig. 4b for 30000 60bp TCRs. This figure shows the distribution of the number of scenarios that need to be enumerated (from most to least likely) to include the true scenario with 50% (blue), 75% (green), 90% (red), or 95% (cyan) confidence. The shorter read length compared to 130bp BCRs entail a higher uncertainty on the V gene identity, for which a higher number of scenarios must be considered.  Fig. 4c the ability of MiXCR, Partis and IGoR to predict the correct scenario of recombination. Since Partis does not model palindromic insertions we here present the performance of the three software one sequences that were generated without any. Although Partis' prediction is improved and reaches 7.4% close to MiXCR's 9.8% accuracy, both remain lower than IGoR's 26.5% correct predictions.  MiXCR performing deterministic alignments and Partis Viterbi learning we used the output assignments to compute the corresponding recombination statistics. We plot them along with IGoR's distribution obtained from our maximum likelihood approach. Note that for ease of presentation we show distributions averaged over conditional dependences. From the two top panels we observe that Partis and MiXCR overestimate the frequency of low number of non templated insertions. Gene usage is mostly consistent between methods. In the four bottom panels, negative number of deletions denote palindromic insertions. We observe that the three methods obtain qualitatively different marginal distribution of number of deletions. FIG. S10: Inference of the 7mer hypermutation model on synthetic sequences. In order to assess the validity of our method we generate synthetic BCRs sequences from a heavy chain model learned on naive data sequences. We then generate Poisson distributed errors on the sequences by simulating mutations at each base pair by a Bernouilli process according to the hypermutation model learned on the V gene of memory sequences. We then cut the sequences in 130bp reads in order to mimic real data sequences. The results of this experiment shows that the model can be perfectly inferred on V and D genes while the scatter on J gene is higher. This can be explained by the limited number of n-mers observed on J gene since sequences were cut to mimic sequencing from a primer in the J. FIG. S11: Prediction of the mutation frequencies on real data. By direct exploration of recombination scenarios we recorded the posterior mutation frequency per individual base pairs on V and J genomic templates and compare it to the independent 7-mer model. We plot a scatter for base pairs that have been observed at least 2000 times on a 100 000 sequences dataset, for which we can compute a reliable mutation frequency, and the mutation frequency predicted by our model. The two top panels show good predictive power for the gene on which the model was learned. However the two bottom panels show a lesser ability to predict the correct mutation frequencies on the whole locus, hence suggesting that differences observed in inferred position weight matrices (Fig. S13) are of biological relevance.
-0.  FIG. S14: Sequence probability of generation estimation By generating synthetic 130bp BCR sequences from an inferred recombination model without errors we were able to compute their probability of generation Pgen (see SI IV A 3). We further introduced errors in those sequences, errors whose statistics correspond to an inferred hypermutation model and computed an estimate for the probability of generation of the unmutated ancestor. We propose two different estimators: Pgen a geometric average of putative ancestors probability of generation weighted by it's posterior probability (green and middle) and Pgen(argmax S P (S|r)) the probability of generation of the most likely ancestor (pink and bottom). Note that due to convergent recombination the most likely ancestor does not necessarily correspond to the sequence implied by the most likely scenario. Thus these two estimates can only be made thanks to direct exploration of recombination scenarios. Both estimators show almost perfect correlation despite the error distribution of most likely ancestor probability of generation being non symmetric.  Fig. S14's caption. The inferred density (blue) is a histogram of each sequence putative ancestors probability of generation weighted by it's posterior probability. We also plot the distribution of sequence likelihoods, that could be obtained by other methods (e.g forward algorithm) and show that it greatly differs from the distribution of generation probability.