Modeling the Amplification of Immunoglobulins through Machine Learning on Sequence-Specific Features

Successful primer design for polymerase chain reaction (PCR) hinges on the ability to identify primers that efficiently amplify template sequences. Here, we generated a novel Taq PCR data set that reports the amplification status for pairs of primers and templates from a reference set of 47 immunoglobulin heavy chain variable sequences and 20 primers. Using logistic regression, we developed TMM, a model for predicting whether a primer amplifies a template given their nucleotide sequences. The model suggests that the free energy of annealing, ΔG, is the key driver of amplification (p = 7.35e-12) and that 3′ mismatches should be considered in dependence on ΔG and the mismatch closest to the 3′ terminus (p = 1.67e-05). We validated TMM by comparing its estimates with those from the thermodynamic model of DECIPHER (DE) and a model based solely on the free energy of annealing (FE). TMM outperformed the other approaches in terms of the area under the receiver operating characteristic curve (TMM: 0.953, FE: 0.941, DE: 0.896). TMM can improve primer design and is freely available via openPrimeR (http://openPrimeR.mpi-inf.mpg.de).


Results
Having selected 908 PTPs from the PCR data set, we classified the amplification status of each PTP either as Amplified or Unamplified depending on the result of gel electrophoresis (Fig. 1). To investigate which properties of PTPs are associated with the amplification status, we computed their physicochemical properties using open-PrimeR, most notably, the free energy of annealing, ΔG [kcal/mol], and three features related to 3′ mismatches: z ∈ {0, 1} 6 ,  ∈ X N 0 , and i X ∈ {0, 1, …, 6} (Fig. 2). We used these features to train a logistic regression model for predicting the amplification status and validated the model by comparing its performance with that from DECIPHER and an approach relying only on ΔG.   Table 2 shows the relationship between the number of primer-template mismatches, free energy of annealing, and the rate of amplification. In our data set, primers with at most 3 mismatches had a 100% amplification rate. It is noteworthy that even primers binding with as many as 6 mismatches obtained a high amplification rate of 83.3%. Note that, for any given number of mismatches, the primers from Set 2 consistently exhibit a greater rate of amplification than the primers from Set 1. Comparing amplified and unamplified PTPs (Fig. 3), we found that the ΔG IQR of observations labeled as Unamplified was higher and more concentrated ([−2.17 kcal/mol, −1.69 kcal/ mol]) than for those labeled as Amplified ([−12.70 kcal/mol, −5.21 kcal/mol]). Amplified samples generally exhibited fewer mismatches in the 3′ hexamer (X N IQR of [0, 1] vs [2,4]) and particularly fewer mismatches close to the 3′ terminus (i X IQR of [0, 3] vs [5,6]) than unamplified samples. Applying two-sided Wilcoxon rank-sum tests revealed that there is a significant difference between Amplified (N = 382) and Unamplified (N = 526) observations concerning both ΔG (p-value 1.68e-107) and i X (p-value 1.51e-91).
Logistic regression models. We used logistic regression in order to identify the features that are predictive of successful PCR amplification events. Since considered primers shared similar physicochemical properties (Table 3), we only considered properties relating to PTPs when defining the two logistic regression models LR 1 and LR 2 (Table 4). LR 1 was defined using the features z, X N , and ΔG. For LR 2 , a term modeling the 3′ mismatch closest to the 3′ terminus, i X , and a term ΔGi x modeling the interaction of ΔG and i x were additionally included. Since LR 1 was not corrected for the association between ΔG and i X , only z 6 (p = 8.25e-08) and ΔG (p < 2e-16) were found to be significantly predictive of the amplification status. Based on LR 2 , on the other hand, only ΔG (p = 1.78e-11) and ΔGi x (p = 5.12e-05) were found to be significantly predictive of the amplification status. This finding indicates that mismatches within the 3′ hexamer are not independent predictors of the amplification status but dependent on ΔG.
Evaluated models and classifiers. In order to form a generalizable logistic regression model for predicting the likelihood of amplification, features were eliminated by performing backward stepwise selection on a model trained using the features considered in LR 2 . The selection procedure reduced the Akaike Information Criterion (AIC) of the initial logistic regression model from 112.34 to 102.38. Besides the intercept, the following three features were selected: ΔG, i X , and the interaction term ΔGi X . In the following, this logistic regression model is called the thermodynamic mismatch model (TMM).
In order to assess the predictive performance of available approaches for predicting the likelihood of PCR amplification, we considered three models: The model DE from DECIPHER 8 , a model solely based on the free energy (FE), and TMM. Besides evaluating the quantitative output of these approaches, we also evaluated the performance of classifiers corresponding to these models by calculating a cutoff based on the estimates of each model in order to classify PTPs either as Amplified or Unamplified. Two types of cutoffs were selected for each model, one optimized for overall accuracy (by maximizing Youden′s index) and another optimized for specificity (Table 5). Classifiers optimized for overall performance and classifiers optimized for high specificity are denoted by subscription of Y or s, respectively. For example, TMM s denotes the high-specificity TMM classifier and TMM Y denotes the TMM classifier that was optimized for overall performance. www.nature.com/scientificreports www.nature.com/scientificreports/ Comparison of model and classifier performance. Quantitative model responses were compared with the categorical amplification status from gel electrophoresis according to the area under the receiver operating characteristic curve (AUC). TMM achieved the highest AUC (0.953) but was closely followed by FE (0.941), and DE (0.896). For all models, predictive performance was higher for observations from Set 2 than for those from Set 1 ( Table 6). The classifier performance was evaluated with respect to sensitivity, specificity, and the F1 score ( Fig. 4). Among high-performance classifiers, TMM Y had a larger F1 score than DE Y and FE Y (90% vs 88% and 88%). Among high-specificity classifiers, TMM s and DE s outperformed FE s with respect to sensitivity (76% and 78% vs 64%).
Interpretation of the TMM model. For interpreting and deploying TMM, a final model was trained on the full data set. The model can be specified in the following way (Table 7). Let p = Pr(y i = Amplified) denote the probability that a template is amplified. Given ΔG and i X , the model estimates p = Pr(y i = Amplified) according to its coefficients β 0 = −5.62, β 1 = −1.55, β 2 = 0.33 and β 3 = 0.18 in the following way:    The intercept of the model is β 0 = −5.62, which indicates that the odds of template amplification are low if the other terms are negligible (i.e. for ΔG → 0 and i X → 0). The second term, (−1.55 + 0.18 i X )⋅ ΔG, is controlled by the free energy of annealing. For typical negative values of ΔG, the odds of amplification increase with decreasing ΔG because −1.55 + 0.18 i X is always negative since 0 ≤ i X ≤ 6. The presence of 3′ terminal mismatches (i X ≠ 0), however, reduces the odds of amplification. The third term, 0.33 i X , increases the odds if a 3′ mismatch is present (i X ≠ 0). This term can be interpreted as a correction factor, which models that there is an overrepresentation of PTPs with high ΔG (e.g. −5 kcal/mol) and high i X .
The model can be visualized as a cube (Fig. 5) whose three dimensions correspond to ΔG, i X , and the estimated likelihood of amplification, p, for the PTPs in the IGHV data set. For low and high free energies (e.g. at −20 and −5 kcal/mol), ΔG dominates p, while i X influences p mostly at intermediate values of ΔG (e.g. at −10 kcal/mol).   Table 6. Model performance in terms of the AUC when validating models on test set observations from individual primer sets.

Discussion
In this work, we presented a novel PCR data set providing the amplification status for all combinations of 47 IGHV templates and 20 primers. Using these data, we investigated the interplay of the free energy of annealing and the presence of 3′ terminal mismatches and found that both factors should be considered in dependence of each other. Based on this insight, we developed TMM, a logistic regression model for predicting amplification events. In our analysis of the IGHV data, we could mostly confirm the established factors governing the efficiency of PCR. More specifically, we could show that templates whose amplification could not be detected via gel electrophoresis are a result of primer-template conformations exhibiting high free energies, an increase in the number of mismatches within the 3′ hexamer, and a tendency for displaying mismatches close to the 3′ terminus. For the present data, however, we found that terminal mismatches by themselves are not significantly predictive of the amplification status when correcting for their association with the free energy of annealing. This finding suggests that a mismatch at the 3′ terminus does not preclude detection via gel electrophoresis as long as primer and template are otherwise highly complementary.
The newly developed TMM model for predicting amplification events has several advantages over the other models. First, since the model is based only on ΔG and i X , it is easily interpretable and it is unlikely that the model suffers from overfitting. Second, the model estimates the probability of amplification, which is a more intuitive measure than the efficiency of amplification from DE. Third, TMM achieved the largest AUC and its high-specificity classifier achieved the highest sensitivity among all classifiers. Since the present data set contains only primers exhibiting specific properties such as the absence of self-dimers and the presence of a GC clamp (Table 3), TMM neither considers primer-nor template-specific properties. Thus, it is likely that TMM overestimates the likelihood of amplification for primers exhibiting less favorable properties or when templates exhibit secondary structures [27][28][29] . Indeed, a previously described logistic regression model proposed by Yuryev et al. 9 considered a larger number of features than TMM. Their model, however, was developed for primer genotyping   www.nature.com/scientificreports www.nature.com/scientificreports/ assays, which renders it inappropriate for applications where several primer-template mismatches need to be considered.
Overall, all three methods achieved high predictive performances on the IGHV data set. Although the predictive performance of FE Y was surprisingly high, the considerably lower performance of FE s indicates that the free energy of annealing by itself lacks robustness. In contrast to DE, which estimates the efficiency of polymerase elongation according to the impact of position-and base-specific effects in the 3′ region, TMM considers only the position of 3′ mismatches. The following two observations could explain why the consideration of base-specific effects did not provide an advantage over TMM, although their influence is extensively described in the literature. First, none of the primers contained in the IGHV data set displayed terminal nucleotides other than G or C (Table 3). Second, since base-specific differences in amplification efficiencies have only been reported for qPCR 8 , these difference may simply not be observable with data from gel electrophoresis. Additionally, the present data ( Table 2 and Fig. 3) suggest that even simple stringent approaches can be used to ensure high rates of amplification, for example, requiring free energies less than −10 kcal/mol or allowing at most three mismatches.
In order to select a suitable prediction model, its field of application should be carefully deliberated. For example, for multiplex primer design, false positive predictions should be avoided at all costs because they may preclude the amplification of templates that are not redundantly covered. False negative predictions, on the other hand, are much more tolerable. Our analysis suggests that high-specificity classifiers such as TMM s or DE s are most appropriate in this scenario. In multiplex scenarios where it is not necessary to amplify all templates, smaller primer sets can be designed by choosing a model with greater sensitivity.
Although models that estimate the likelihood of amplification should be an integral part of rational primer design approaches, there are few available models for this task. The lack of publicly available PCR data is not only a limiting factor for model development but also for improving our understanding of the molecular characteristics that govern PCR amplification. Only when enough data are available will it be possible to devise more comprehensive models that consider all relevant properties concerning primers, templates, and their interaction. Here, we presented a novel PCR data set on which basis we developed TMM, a model for predicting the PCR amplification status, which is freely available via openPrimeR (http://openprimer.mpi-inf.mpg.de/).

Materials and Methods
Template design and PCR measurements. We cloned 47 heavy chain fragments from naive B cells into pCR4-TOPO-vector backbones. Each fragment comprises a different functional IGHV gene with the complete leader (L) region, the complete V region and a short part of the constant region. The individual V genes served as representative templates for two different IGHV-specific primer sets. Set 1 is a set of 16 forward primers that was recently designed using openPrimeR 25 , while Set 2 consists of 4 forward primers that were described previously 26 . We performed three independent PCR reactions for each of the 20 primers on all 47 templates with the same IgM constant region-specific reverse primer (GGTTGGGGCGGATGCACTCC) 30 . All primers used in the experiments are listed in Table 3. PCRs were performed in 25 µL reactions with 2U/rxn Platinum Taq (Thermofisher), 0.2 µM forward and reverse primer, 0.2 mM dNTPs, 1.5 mM MgCl 2 , and 6% Kb extender under the following cycling conditions: 2 min initial denaturation at 94 °C followed by 25 cycles of 30 s at 94 °C, 30 s at 57 °C (Set 2) was assigned a label y i ∈ {Amplified, Unamplified} based on the evaluation of gel electrophoresis by five persons. Each of the five reviewers visually inspected the gels and independently classified the amplification status. If a band was visible in a gel, the corresponding measurement was labeled as Amplified and otherwise as Unamplified (Fig. 1). The following procedure was used to identify y i,j , the label of PTP i according to reviewer j ∈ {1, …, 5} from a set of triplicate measurements. If at least two of three measurements were labeled as Amplified, y i,j was set to Amplified. Otherwise, y i,j was set to Unamplified. Let n i,A = |{y i,j |yi,j = Amplified}| and n i,U = |{y i,j |y i,j = Unamplified}| indicate the number of times that PTP i was labeled as Amplified or Unamplified, respectively. By setting we labeled PTP i as Amplified only if the majority of reviewers had labeled the PTP as Amplified.
We used openPrimeR to enrich the PCR data with physicochemical properties relating to primers and PTPs. The most likely binding mode for every PTP was identified by selecting the binding conformation minimizing the number of mismatches. Since the exact annealing site of primers is uncertain for PTPs subject to many mismatches, we excluded PTPs with more than 12 mismatches. This reduced the size of the data set from 940 to 908 observations. Based on the determined binding conformation, we derived further properties such as the position of primer-template mismatches. The free energy of annealing ΔG was computed with OligoArrayAux 32 using temperatures of 55 °C and 57 °C for PTPs from Set 1 and Set 2, respectively. Additionally, the following primer-specific properties were computed: primer length, extent of GC clamp, GC ratio, melting temperature, number of repeats/runs, free energy of secondary structures, and self-dimerization.
For model development purposes, we split the data set into three distinct parts (Table 8). To obtain an independent data set for the selection of classifier cutoffs, 25% of the observations were randomly sampled for inclusion in the validation set. We randomly selected 50% of the remaining observations for inclusion in the training data set, which was used for forming a supervised learning model, and the remainder for inclusion in the test data set, which was used for evaluating model performance.
Feature encoding. In order to investigate the impact of 3′ terminal mismatches, we implemented several encodings, which are illustrated in Fig. 2. The mismatch feature vector z ∈ {0, 1} 6 relies on a binary encoding to indicate whether a mismatch was identified at the j-th position in the 3′ hexamer via Here, j ∈ {1, 2, … 6} identifies the 3′ hexamer position such that j = 1 indicates the first position in the 3′ hexamer and j = 6 indicates the 3′ terminal position. To explicitly model the augmenting effect of co-occurring mismatches in the 3′ hexamer 8 , the total number of 3′ hexamer mismatches was encoded as = ∑ X z N j j . Since positions closer to the 3′ terminus deteriorate PCR efficiency to a greater degree 5,6,12-15 , we encoded the 3′ hexamer mismatch closest to the 3′ terminus by setting For example, a primer without 3′ mismatches has i X = 0, while a primer exhibiting mismatches at positions 4 and 6 in the 3′ hexamer has i X = 6.
Logistic regression models. We used multivariate logistic regression models in order to investigate the influence of individual features on the template amplification status. Logistic regression is a commonly used approach for problems with categorical outcomes. In this case, we would like to estimate the amplification status y i ∈ {Amplified, Unamplified }. Let p = Pr(y i = Amplified) denote the probability that a template is amplified and let p indicate the corresponding estimated likelihood. Further, let β 0 indicate the model intercept and let β i with  ∈ i indicate the weight associated with the i-th feature x i . Then the logistic regression model can be formulated as   Due to the small number of evaluated primers, only terms relating to PTPs were considered as features for the logistic regression models. The logistic regression models LR 1 and LR 2 were used for studying feature importance. While LR 1 was defined using the mismatch feature vector z ∈ {0, 1} 6 , the number of mismatches in the 3′ hexamer (X N ), and the free energy of annealing ΔG, LR 2 additionally included the terms i x and ΔGi X in order to correct for the association between ΔG and i X .
For the definition of a logistic regression model estimating the probability of amplification, we formulated TMM by performing feature selection using backward stepwise selection. This process was guided by the AIC 33 , which is defined as where k is the number of model parameters and L indicates the maximum value of the likelihood function. Starting from a model trained on the LR 2 features in the validation set, variables were iteratively eliminated in order to minimize the AIC, thereby ensuring that the final model obtains the best possible fit at the lowest possible complexity.