Deep learning-based rapid generation of broadly reactive antibodies against SARS-CoV-2 and its Omicron variant

Dear

framework for the first part. We found that ACNN outperforms the other two networks with significantly higher accuracy and recall (Supplementary information, Tables S3,   S4). The more detailed evaluation of the Precision Recall Curve (PRC) and Receiver Operating Characteristics (ROC) curve also reveals the significant advantage of ACNN for the better trade-off between the Precision and Recall rate ( Fig. 1a; Supplementary   information, Fig. S1b).
In featurization step, each amino acid on the chain of heavy, light and antigens are encoded into a one-hot vector denoted as , , ∈ 0,1 , and the amino acids are then concatenated to three sequences' matrices (feature maps) ≔ ⋯ , ≔ ⋯ , ≔ ⋯ ∈ 0,1 , where denotes the standardized length of input encoded sequences with zero-padding.
In feature extraction step, three-branch weight sharing ACNN is used to extract the intra-sequences feature vectors and processes as feature maps in a latent space:

ACNN ACNN
Where , , ∈ denote the processed feature maps, denotes the feature channel number.
In feature selection step, the feature maps are filtered to select the features , , ∈ via global max-pooling from each feature channel dimension: pool pool pool Finally, the selected features are concatenated and fed into a residual structural MLP to process the inter-sequence information and predict a probability:

MLP
Where ∈ 0,1 ∩ is the output from a sigmoid activation function at the end layer of MLP to estimate the binding probability.
Compared with CNN, ACNN utilizes a series of atrous convolution layers, with the atrous rates computationally optimized to extend the receptive field without any losses of resolution resulting from down-sampling. Thus, Atrous-III block is able to achieve broader field of view with higher resolution than Res-conv block, providing more comprehensive and informative features (Supplementary information, Fig. S4). Thus, ACNN is able to achieve broader field of view with higher resolution than CNN, providing more comprehensive and informative features. Transformer employs attention mechanism for a global-range receptive field, but also requires the positional encoding to preserve the locational information and a larger dataset to fulfil its potential.
A 5-folder cross validation is used to compare the three modules as recorded in (Supplementary information, Table S3). Because the sequence length of heavy, light and antigen RBD varies under 300, was set as 300 in our experiments, which could be increased if required.
The networks were implemented in Python using packages Tensorflow and Keras. Each model was end-to-end trained using Adam optimizer for 800 epochs, with a batch size of 24, and the initial learning rate set as 0.001.

Evaluation criteria using Receiver Operating Characteristics
The criteria include recall (or sensitivity, hit rate, true positive rate), selectivity (Selec. or specificity, true negative rate), precision (Prec. or positive predictive value), negative predictive value (NPV), accuracy (Acc.), and balanced accuracy (B-Acc.): is also used to further illustrate the capability of binding prediction balanced between positive and negative (Supplementary information, Fig. S1b). ROC is a curve plotted by Recall and Selectivity with varying thresholding value of prediction probability, thus can demonstrate the capability of the binding prediction with varying condition of thresholding based on a positive-negative imbalanced validation dataset. The prediction score of the model gives the probability of paired heavy and light chain to bind to given antigens, but not the affinity of the antibody. Therefore, there is no correlation between ELISA OD and prediction score (Supplementary information, Fig. S1c).

Recall
The ACNN based XBCR-net achieved a high selectivity (92%) with the validation dataset, indicating that it had a low false-positive prediction rate. Therefore, we hypothesized that XBCR-net could predict the antigen-specific BCRs from scBCR-seq data. To test the accuracy of our model in scBCR-seq data, we performed the prediction using single-cell data from the GEO database (GSE171703) that were not covered by our training and validation dataset as outlined (Supplementary information, Data S2).
First, the amino acid sequences of the paired VH and VL are encoded as input for XBCR-net. These sequences were checked to be in-frame and not present in the training dataset. XBCR-net then predicted the binding probability of each VH-VL pair, ranging from "0" to "1" against the RBD of SARS-CoV-2. We have taken the predicted probability greater than "0.5" as the positive binding while smaller and equal to "0.5" as the negative. We then compared our predictions with the experimentally confirmed RBD binding BCRs from the same scBCR-seq dataset (GSE171703, single cell BCR sequencing from SARS-CoV-2 wild type infected patients) as an independent validation. We have trained nine models of XBCR-net, including the models with three varying random initialization seeds and different data folder selections for training and validation, respectively, based on ACNN, CNN, and FCN. The prediction results of these 9 models were then compared with the binding information of the mAbs cloned from the same scBCR-seq data (Supplementary information, Fig. S5). The performance of our model is then compared with the published transformer based neural network (Supplementary information, Table S4)

Antibody cloning and purification
The method was described in details in previous studies. The variable region of paired heavy and light chains was synthesized and cloned into vectors expressing the constant region of human IgG1. The vectors were then transfected into 293F cells by polyetherimide (PEI) at 1 mg/ml. The supernatant was collected at day 5 and purified by protein A embedded columns. The antibody concentration was measured by nanodrop.

Antibody sequence analysis
The antibody sequences are analyzed by IMGT-high-V-quest to gain the detailed information of V gene usage, CDR and FR region sequences. The sequences are then processed with ChangeO and VDJtools as described in earlier papers. The V gene and J gene usage was plotted with the VDJtools package while the statistics were done by Prism 8.0. The CDR-H3 length and sequence dominance was also analysed and plotted by VDJtools package. To cluster the CDR-H3, antibodies with the same CDR-H3 length and more than 80% similarity were allocated to the same cluster by using a deterministic clustering approach 4 . The antibodies in different clusters, therefore should have more than 20% difference in CDR-H3. Sequence logos were generated with online python based tool. We defined pan-SARS2 as antibodies that bind to SARS-CoV-2 and its alpha, beta, delta, gamma and omicron variants.

Criteria for antibody selection from the prediction
The antibodies that received a prediction score higher than 0.5 by all three ACNN based XBCR-net models are first chosen. These antibodies were then further filtered by the best XBCR-net with a prediction score higher than 0.99. Because of the biased IGHV3-30 usage we observed in the predicted cross-reactive RBD binders, we selected 10 IGHV3-30 antibodies in addition to the other 15 antibodies with diverse IGHV usage from the filtered antibody lists.

Anti-RBD ELISA
The NUNC 96-well plate was coated with 1 µg/ml RBD of SARS-CoV-2, SARS-CoV-2 omicron or SARS-CoV overnight at 4 °C. The plates were washed with PBST (0.1% Tween) three times before blocking with 3% BSA for one hour on the shaker.
The mAbs were diluted to 1 µg/ml to co-incubate with the coated antigens for one hour before washing and subsequent co-incubation with anti-human IgG-AP (Sigma). The colour was developed by PNPP tablets (Sigma), followed by optical density at 450 nm (OD450) recording with ELISA microplate reader. An antibody is defined as ELISApositive when the OD450 is three times larger than the negative control, which uses an H7N9-specific human IgG1 antibody (HG1K, Sino Biological).

Pseudovirus neutralization assay
The pseudovirus neutralization was described in details in previous studies 5

Statistical Analysis
Statistical analyses were performed and specified in the figure legend. IC50 was determined by a four-parameter non-linear regression using GraphPad Prism 9.0.  and FCN are compared. N=3 models.