Humanization of Antibodies using a Statistical Inference Approach

Antibody humanization is a key step in the preclinical phase of the development of therapeutic antibodies, originally developed and tested in non-human models (most typically, in mouse). The standard technique of Complementarity-Determining Regions (CDR) grafting into human Framework Regions of germline sequences has some important drawbacks, in that the resulting sequences often need further back-mutations to ensure functionality and/or stability. Here we propose a new method to characterize the statistical distribution of the sequences of the variable regions of human antibodies, that takes into account phenotypical correlations between pairs of residues, both within and between chains. We define a “humanness score” of a sequence, comparing its performance in distinguishing human from murine sequences, with that of some alternative scores in the literature. We also compare the score with the experimental immunogenicity of clinically used antibodies. Finally, we use the humanness score as an optimization function and perform a search in the sequence space, starting from different murine sequences and keeping the CDR regions unchanged. Our results show that our humanness score outperforms other methods in sequence classification, and the optimization protocol is able to generate humanized sequences that are recognized as human by standard homology modelling tools.

For both human and mouse sequences, we also consider separately the VH and VL databases obtained by splitting the combined databases into their VH and VL parts (and removing repeated VH or VL chains, if necessary).
Notice that we use the mouse learning database only for the classification with two reference distributions, while the rest of our results are based on just the human learning dataset.

Test databases
We download from the DIGIT server 3 the whole database of matching human VH/VL sequences (3322 sequences), aligned according to the Kabat scheme. We remove the alignment, split the sequences into separate VH and VL chains, and filter on their length as above. We perform the same steps for the database of murine VH/VL matching sequences (1933 units.) Then, we use ANARCI to align the human VH and VL files according to the AHo scheme, and eliminate repeated sequences both within the DIGIT VH or VL files, as well as between these and the corresponding murine DIGIT aligned files, and the aligned human and mouse learning databases. Finally, we combine the VH and VL sequences, obtaining a database of 1388 sequences.
Analogously, we use ANARCI to align the murine VH and VL files according to the AHo scheme, eliminating repeated sequences both within the DIGIT VH or VL files, and between these and the corresponding human DIGIT aligned files, and the aligned human and mouse learning databases. Finally, we combine the VH and VL sequences in a database of 1379 sequences.
Again, for both human and mouse sequences, we also consider the VH and VL databases obtained by splitting the combined databases.

Classifiers based on inference of a probabilistic model
We follow Baldassi et al. 11 and infer a multivariate gaussian distribution from each VH, VL and combined VHVL learning databases, using uninformative prior distributions. Then, we calculate the posterior predictive distribution, that results to be a multivariate Student distribution, and use it to score the sequences in the test datasets. We start by mapping the L-residues-long, aligned sequences of the database (made up by M sequences, and drawn from a Q = 20 letters alphabet) to a binary sequence of N = QL bits {x i = 0, 1, i = 1, . . . , N}, that, in block of Q bits, represent all the amino acids. Namely, residue of type a = 1, . . . , Q at position k = 1, . . . , L of a given sequence, is mapped to a binary variable x i=(k−1)Q+a . In principle, in any block of Q bits only one bit (or none, to represent a gap) can be set to 1, to prevent the coexistence of more than one amino acid at a site. In practice, following Baldassi et al. 11 , we do not impose such constraint, letting the system itself implementing it through anticorrelations; we also introduce weights w m for each sequence in the alignment, in order to calculate the empirical average and covariance: with M e = ∑ M m=1 w m . The weights are calculated as w m = 1/n m , where: is the number of sequences whose similarity S lm with sequence m exceeds a threshold LΩ. Here θ (•) is the step function; the similarity is defined in terms of the number of identical residues (excluding gaps): i indicates the amino acid type at position i of sequence m and δ X,Y is the Kronecker delta. At difference from the choice in 11 , the threshold Ω is defined as the value that maximizes the Frobenius norm of the resultingC i j .
As in Ref. 11 , we assume that each of the M sequences in the database, . . , N}, is drawn from a normal distribution with parameters µ, Σ (thus promoting x m i to be real numbers): with |Σ| indicating the determinant of the N × N matrix Σ. In order to infer the best values of µ, Σ from the known database of human sequences X = {x m , m = 1 . . . , M}, we also assume a Normal Inverse Wishart prior distribution for the parameters µ, Σ: Using Bayes theorem, the posterior distribution for µ, Σ, given the data X can be calculated, yielding again a NIW distribution with new parameters η , κ , Λ , ν : where with The mean of µ and Σ with the posterior NIW distribution are given by

4/10
while the mode of the distribution is achieved for µ = η , Σ = Λ /(ν − N + 1). In a frequentist approach, we could fix µ, Σ to their modes, and use Eq.(4) to draw new (humanized) sequences from the distribution. However in a Bayesian approach, we recognize that we do not know the "correct value", but just the distribution of µ, Σ, so we derive the posterior predictive distribution for each new sequence y = {y i , i = 1, . . . , N}, by integrating on µ, Σ the joint probability: p(y, µ, Σ|X) = p(y|µ, Σ) p post (µ, Σ|X). Plugging Eqs. (4), (6) into the above equation, we get ). Due to the normalization of the NIW distribution to 1, upon integrating Eq. (10) on µ, Σ, and using the Sylvester determinant identity |I m + AB| = |I n + BA| (where A is a m × n and B an n × m matrix), we get the posterior predictive distribution of a new sequence y given the database of sequences X: where we have introduced the multivariate t-distribution probability density: and we have used Eq. 9 to eliminate η , Λ . Finally, using Eqs. (7), (8) and choosing, as in Ref. 11 , ν = N + κ + 1, Σ prior ≡ Λ ν−N−1 = U, and the prior estimates η and U as those corresponding to the mean and covariance estimates of a uniformly distributed sample, we get: with: We use the logarithm of p(y|X) in Eq. (14) as a score of the humanness of any given sequence y, and call it the "MG score". Notice that p(y|X) is a probability density, and not a probability: as such, it is not bound between 0 and 1, and actually, due to its strong localization in the high dimensional sequence space, it will greatly exceed 1.
Finally, we have to choose a value for λ to be used in our inference, to set the best amount of regularization λU that should be added to the empirical covariance to optimize the statistical model.
We do so by analyzing the different ROC curves, obtained at different values of λ in classifying the test databases (see the previous section for the definition of the ROC curve), and choosing the value of λ yielding the curve with the maximal area under it. Finally, for the classification with one reference distribution, we select, as the threshold score, the one corresponding to the point, on the ROC curve for the test database, with the highest value of the Youden's coefficient. Notice that the threshold score identified in this way is maintained when analyzing the therapeutic antibodies, and in the humanization protocol.
When classifying with two distributions, we fix the λ for both the murine and human statistical models as explained above. Then, we simply score each query sequence in the test databases with both statistical models, classifying it as human or murine depending on the which of the two scores is higher.

5/10
Correlation-neglecting classification Classification without correlations is performed by maintaining the same value of Ω and asking that Σ i j = 0 if k = l, where i = (k − 1)Q + a, j = (l − 1)Q + b, k, l = 1, . . . , L, a, b = 1, . . . , Q, i.e. neglecting correlations between blocks of binary variables representing residues at different positions along the sequence; this ensures that Σ −1 will be block-diagonal as well, with no interactions between residues. This also yields that we can keep only the terms corresponding to the block diagonal terms in the matrices multiplying Σ −1 , due to the fact that the trace in the argument of the exponential in, for instance, Eq. (10) will be the same. Thus, a part from a different normalization in front of Eq. 14, the expressions Eqs. (14), (15) will still be valid, with block-diagonal matrices. We optimize λ as before, obtaining λ = 0.027.
Analogously, classification without correlations between VH and VL regions is performed by maintaining the same value of Ω and asking that Σ i j = 0 if k,l belong to the VH and VL region, respectively, i.e., i = (k − 1)Q + a, j = (l − 1)Q + b, k, l = 1, . . . , L/2, a, b = L/2 + 1, . . . , Q (since in AHo numbering scheme, the alignments of VH and VL regions have the same length). Again, the resulting Σ −1 will be block-diagonal, with no interactions between residues belonging to different variable regions. The value of λ in this case is λ = 0.067.

Definition of an Immunogenic Score
We run the MHCII program (IEDB MHC II, version 2.16.2: http://tools.iedb.org/mhcii/download/) 12, 13 mhc II binding.py with the "IEDB recommended" flag. Such program provides in output a list of 15-residues-long peptides, with several scores associated to them: we decide to use the percentile score as a reference, as it is a consensus score of different methods. According to the instructions, low percentile scores correspond to peptides that are more easily bound and recognized by T-cells, eliciting immunogenic response. However, there is not a clear rule to state when an antibody sequence will cause such response. We tried several quantities to define an immunogenic score: average percentile scores of the k lowest scoring peptides, number of alleles recognizing a peptide with percentile score below a certain threshold x 0 , number of peptides within the antibody with a percentile score below a certain threshold x 0 ; we finally decide for the latter, which is the one that best distinguish human from murine sequences of the learning databases, according to the ROC curves reported in Table S2. Given a threshold number s of peptides with percentile score less than x 0 , we define as "immunogenic" an antibody that presents more than s peptides below the percentile threshold x 0 , and will consider as True Positives the murine sequences that are "immunogenic", False Positives the human sequences that are "immunogenic", and so on. Such definitions allow to calculate False Positive Rates and True Positive Rates for each value of s: upon varying "s", we obtain a ROC curve, for any given value of the percentile threshold x 0 .
Notably, the value x 0 = 0.31 corresponds both to the highest AUC (0.698203), and to the highest maximal values of Matthews Correlation Coefficient (MCC=0.294284, obtained for s = 23), while Youden's J Statistic takes its maximal value at x 0 = 0.30 (s = 15, YJS=0.305817). We decide to take x 0 = 0.31 as the working value that best allows to distinguish human from murine. So, it will be considered as immunogenic as a mouse sequence, a VHVL sequence presenting globally p ≥ 23 short peptides (of length 15) with a consensus percentile score x ≤ 0.31.   (s, b) where s is the value of the score (either MG, T all , T 20 or T 1 ) yielding the maximal value of the MCC or YJS, and b is the maximal value itself. We report T 20 in the comparison as it is the value proposed in Ref. 14 Figure S2. ROC curves for the immunogenic score, calculated on the human and murine learning databases. The ROC curves are calculated for different values of the percentile score below which peptides are considered immunogenic. Each curve is obtained by reporting TPR vs FPR at varying value of the the threshold number of peptides, per sequence, that are found below the threshold. That is, at a given value of the percentile score x 0 , we assume that a sequence containing more than n 0 peptides with MHCII percentile score less than x 0 will be immunogenic, and calculate the FPR and TPR values at varying n 0 : we consider a murine antibody with more than n 0 peptides with a percentile score below x 0 , as a True Positive; a True Negative will be a human antibody with less than n 0 peptides below n 0 . We observe that the immunogenic score is not very efficient in distinguishing human from murine sequences: the area under the ROC curve is just AUC = 0.698 in the best case.  Figure S4. MG-score (left y-axis) and Hamming distance (right y-axis) for both the Steepest Descent (left) and the Simulated Annealing Monte Carlo (right) for our first murine target. The Hamming distance of the sequence at each step ("seq") to both the original ("orig"), murine sequence and the experimentally humanized one ("hum") are reported; color codes are the same in both panels. Table S4. BLAST results on simulated and experimentally humanized sequences. We run protein-BLAST 15 separately on the VH and VL regions of the optimal SD sequences and the experimentally humanized ones, and report the most similar sequence found by BLAST: "Seq" indicates the target sequence, "MG" is the MG-score (in parentheses, the MG-score of the original murine sequence), and the rest of the columns report the output from protein-BLAST, run with the standard parameters, about the most similar sequence to the query one: the e-value is the measure of the quality of the identification (the lower, the better), the % identity refers to the sequence homology between the query and the best retrieved sequence, identified by the description and the accession number. The target sequence is identified as in the following examples: 1 EH H: H chain of the Experimentally Humanized (EH) sequence of target 1; 1 SD L: L chain of our SD Humanized sequence of target 1.