Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# In silico detection of SARS-CoV-2 specific B-cell epitopes and validation in ELISA for serological diagnosis of COVID-19

## Abstract

Rapid generation of diagnostics is paramount to understand epidemiology and to control the spread of emerging infectious diseases such as COVID-19. Computational methods to predict serodiagnostic epitopes that are specific for the pathogen could help accelerate the development of new diagnostics. A systematic survey of 27 SARS-CoV-2 proteins was conducted to assess whether existing B-cell epitope prediction methods, combined with comprehensive mining of sequence databases and structural data, could predict whether a particular protein would be suitable for serodiagnosis. Nine of the predictions were validated with recombinant SARS-CoV-2 proteins in the ELISA format using plasma and sera from patients with SARS-CoV-2 infection, and a further 11 predictions were compared to the recent literature. Results appeared to be in agreement with 12 of the predictions, in disagreement with 3, while a further 5 were deemed inconclusive. We showed that two of our top five candidates, the N-terminal fragment of the nucleoprotein and the receptor-binding domain of the spike protein, have the highest sensitivity and specificity and signal-to-noise ratio for detecting COVID-19 sera/plasma by ELISA. Mixing the two antigens together for coating ELISA plates led to a sensitivity of 94% (N = 80 samples from persons with RT-PCR confirmed SARS-CoV-2 infection), and a specificity of 97.2% (N = 106 control samples).

## Introduction

The COVID-19 pandemic has highlighted the importance of diagnostic tools that are accurate and cost-effective. So far reports based on linear epitope scanning have been limited in scope to a few selected proteins1. Recent antibody response results are based on single SARS-CoV-2 proteins2,3, or limit testing of the polyprotein to a single domain4. We propose a simple method to predict likely candidates for serological diagnostics based on existing prediction tools for B-cell epitopes that leverages available structural and sequencing data.

## Results and discussion

### Epitope predictions

One of our aims is to propose a visual aid to select suitable targets for the serological diagnosis of a viral pathogen. Results come in the form of a Summary Plot for each viral protein (Suppl Figure S1).

First, linear epitopes were predicted on the sequence of each protein with BepiPred25. At the chosen threshold (80% specificity/30% sensitivity)6, epitopes were detected for 2/3 of the proteins. Distributions per protein and length show that over half the predicted epitopes were located on only 4 proteins: S, nsp3, N and nsp12 (Fig. 1A) and that over 80% of the epitopes were below 15 residues long (Fig. 1B). The raw epitope scores plotted against the sequence illustrates how heavily the threshold affects the prediction (Figure S1 A). This is a known limitation of the predictor which achieves high specificity at a high sensitivity cost. Nonetheless, strong signals were observed for the nucleoprotein, consistent with several recent studies2,7.

Second, we assessed the level of sequence conservation of each SARS-CoV-2 epitope compared to endemic human coronaviruses (HCoVs) strains HKU1, 229E, OC43 and NL63. For this purpose, a comprehensive database of endemic HCoVs protein sequences was built from 3 different sources: NCBI, UniProt and ViPR (Table S2). Prior to performing the comparison, each epitope was trimmed to 15 residues to avoid length bias. Trimmed epitopes were scanned using a sliding window against the endemic HCoV sequences. The average pairwise identity between the epitope and aligned residues was recorded at each position. Conservation was defined as the maximal value obtained after scanning was completed.

Conservation levels ranged between 40 and 100% with large local variations (Figure S1 B). Perhaps unsurprisingly, the most conserved epitopes were located in enzymes that perform essential functions in viral replication, such as the nsp12 RNA-dependent RNA polymerase, and are thus likely to be conserved across species (Fig. 2). For proteins with less conserved epitopes overall, the average masked large local differences. In particular, the nucleoprotein contained epitopes in both the N-terminal RNA-binding and C-terminal dimerization domains, but only the C-terminus includes highly conserved, non-SARS-CoV-2-specific epitopes.

Third, conformational B-cell epitopes were predicted with DiscoTope2 on molecular structures of SARS-CoV-2 proteins8. The unprecedented speed at which experimental SARS-CoV-2 structures have been solved and deposited in the PDB allowed us to predict epitopes on 13 proteins (Table S3)9,10,11,12. Coverage was further extended to all proteins except E envelope protein, ORF9b and ORF14, by including high-quality homology models obtained with the Robetta platform and ab-initio models from the CASP-commons competition (Table S3)13,14. Since the Robetta models were based on PDB templates with high sequence similarity (e.g. SARS-CoV homologs with greater than 90% sequence identity), they are likely to have near atomic level accuracy in the range of less than a few angstroms root mean square deviation (RMSD). For example, the Robetta model for the closed state of the spike trimer is within 2 angstroms RMSD to the experimentally determined structure (6VXX) which was released after the models were generated. The CASP models are de-novo predictions and thus the prediction of conformational epitopes on these models are not directly based on experimental structure determination.

DiscoTope2 scores were plotted on the sequence to show the overlap between conformational and linear epitopes (Figure S1 C). Seven linear epitopes, one on nsp14, and six on spike, had no structural coverage because they matched gaps in the structures, which are indicative of highly flexible segments of the protein. Correlations between mobile segments and antigenicity are well known15. Provided they are not masked by another dominant epitope, we might therefore expect these epitopes to elicit a strong immune response even on the fully folded protein. One of these, 469-STEIYQAGST on the receptor-binding domain of the spike protein (spike RBD), could be of particular interest as conservation analysis indicates that it is SARS-CoV-2 specific.

To predict whether a linear epitope might be dominant, we assigned a ‘dominance score’ to regions where linear and conformational models both predicted an epitope. Six potentially dominant epitopes were thus predicted: nprot_5, nsp3_21 and nsp16_4 are likely to be specific with a conservation level below 0.6 (i.e. specificity above 0.4); nsp2_4, nprot_10 and nprot_11 are likely non-specific (Fig. 3). Interestingly, three of these epitopes are located on the nucleoprotein: the specific epitope is on the N-terminal RNA-binding domain, while the two non-specific epitopes are on the C-terminal dimerization domain. This observation suggests that removal of C-terminal domain might be required to achieve specificity from endemic coronaviruses when assaying antibodies against the nucleoprotein.

Conformational epitopes were also predicted in regions where the linear epitope signal was below the threshold. Four of these structural epitopes were specific with a high dominance score, including nsp3_767s on the papain-like protease and spike_489s on the spike RBD. Lack of linear epitope signal could suggest that an immune response to these epitopes might only be achieved with the native protein or a structural mimic. However, it is also conceivable that the linear prediction tool simply failed to detect these epitopes.

Finally, we analyzed the SARS-CoV-2 genomes in the GISAID repository to locate variants that might affect our specificity predictions. A conservative approach was adopted to extract variants from an alignment of 6,796 unique GISAID sequences (downloaded on 04/20/2020), where each step sought to minimize erroneous variant calls due to sequencing errors. The degree of variability was further quantified by calculating the Shannon entropy at each position in the alignment. Variants were observed in all but 10 of the predicted linear epitopes (Figure S1 D and E). Unsurprisingly, no variants were observed in the nsp12_20 epitope that is 100% conserved in the endemic HCoVs. Shannon entropy identified eleven positions with conserved variants on 9 proteins. Sites with the highest Shannon Entropy values were nsp12 L323P and spike G614D, with a variant frequency of 40% in both cases. The spike variant is located in the 601–640 region that has been previously reported as 78% identical to a dominant B-cell epitope on SARS-CoV6. However, G614 is in a buried beta-sheet in the spike cryo-EM structure and no linear or structural epitope was predicted at that position. Two conserved variants were detected in predicted epitopes. The first is G251V on epitope orf3a_5 249-IDGSSGVVNP observed in 12% of ORF3a sequences; however, orf3a_5 is non-specific and is probably not a good diagnostic antigen candidate. The second is P586S on epitope nsp2_13 584-LQPLEQPTSEAVE observed in 5% of nsp2 sequences. Epitope nsp2_13 is specific; however, we found that the substitution had no effect on linear epitope prediction results or specificity. While it is straightforward to analyze the effect of single mutations, the exercise becomes more complex if multiple positions vary. To predict whether an epitope might be susceptible to sequence variations, we assigned a variant density score that reflects how many positions had at least one observed variant and compared this value to the overall variation frequency (Fig. 4). The resulting plot reveals at a glance the epitopes with a conserved variant discussed above, orf3a_5 and nsp2_13. Figure 4 also highlights three epitopes with infrequent variants (low SE100) but where 80% or more of positions are susceptible to variations, nprot_8_1, orf8_3, and nprot_322s. These epitopes might be too variable to consider for diagnostic use.

### Epitope ranking

The summary plots provide a visual guide to evaluate the suitability of each protein for serological diagnostics, based on the raw predictions results displayed along the sequence. Favorable epitopes combine continuous BepiPred2 scores above the threshold, low conservation vs endemic HCoVs, continuous DiscoTope2 scores above the threshold, low variability and low density of variants. Our rationale for ranking individual epitopes based on derived data such as dominance and variant density scores is reflected in the scoring table (Table S4). Epitopes are classified into one of four broad categories: specific with high dominance score, specific linear, non-specific with high dominance score and non-specific linear. For each protein, we compiled the numbers of epitopes per category to predict whether the protein is likely or not to elicit an immune response and whether that response is likely or not to be specific (Table 1). As illustrated in the table, candidates with no predicted epitopes above the thresholds are marked as ‘no response’. Our top candidates marked as ‘specific’ contain only specific epitopes and at least one of them has a high dominance score. Three proteins match these criteria: the N nucleoprotein RNA-binding domain, the nsp3 papain-like viral protease domain and the nsp16 MTase. Next are the proteins that match the same criteria, but also contain non-specific linear epitopes: the S spike RBD and nsp3 full-length protein. We postulate that the immune response will likely be dominated by the specific epitopes with high dominance score as long as the protein is folded in native conformation. Altogether these 5 proteins would be our top candidates for serological diagnostics.

### Validation

We constructed a panel of 80 persons that had COVID-19 disease, all of whom reported having a positive nasopharyngeal swab for SARS-CoV-2 RNA. 71 of 80 persons in our cohort were convalescent with 21 days or more since symptom onset. Only two persons were likely acutely infected, one who had a blood draw the day of their SARS-CoV-2+ RNA test and another who had a blood draw 2 days after their SARS-CoV-2+ RNA test. The remaining 7 individuals were between 9 and 19 days of symptom resolution. We also gathered 106 serum and plasma “negative control” samples either collected before November 2019 (N = 77) or demonstrated negative on the Abbott Architect SARS-CoV-2 IgG Assay (N = 29).

From a subset of 6 COVID-19+ plasma and 5 “negative control” plasma, we compared our panel of recombinant SARS-CoV-2 proteins, including spike-receptor binding domain (S-RBD), nucleoprotein (full length), nsp9, nsp10/nsp16 complex, nsp15, and nsp7/nsp8 complex (Fig. 5). S-RBD and nucleoprotein full length both gave significantly more ELISA signal with the COVID-19+ patients’ serum/plasma samples than the negative control samples. However, the signal-to-noise ratio of full-length nucleoprotein was only about 2, consistent with the predictions that the C-terminal portion of nucleoprotein has epitopes that are conserved with low pathogenicity coronaviruses. We then expressed the N-terminal, RNA-binding domain of the nucleoprotein (N-Nt), AA residues 47–173 and used this as an antigen for ELISA. The robust signal was retained in the N-Nt ELISA, and the signal-to-noise increased to 4, a twofold increase.

Testing 78 COVID-19+ sera/plasma, we found N-Nt gave a sensitivity of 87% (68 of 78 COVID-19+ positive); using 22 negative controls yielded 100% specificity. (Table S5). Testing S-RBD by ELISA yielded a sensitivity of 92% (72 of 78 COVID-19+ positive) and 100% specificity (0 of 22 negative controls tested positive). A few COVID-19+ persons were disconjugate in their seroreactivity, that is, were positive for one antigen in ELISA, but negative for the other.

Combining the two antigens, S-RBD and N-Nt, to coat ELISA wells detected all of the positive sera in both assays including the disconjugate ones, but also revealed a couple of COVID-19+ plasma that were negative for both antigens alone, that moved into the positive range. Thus, this combined S-RBD and N-Nt ELISA was run on all 80 COVID-19+ sera/plasma and all 106 negative control sera/plasma yielding 94% sensitivity (75 of 80 COVID-19+ persons were positive) and 97.2% specificity (three of the 106 negative control plasma or sera tested were positive).

The five negatives in the combined S-RBD and N-Nt ELISA assay were: one person with SARS-CoV-2 RNA positive test the same day (no records exist as to length of symptoms); a second person with COVID-19 symptom onset 17d prior, a single positive SARS-CoV-2 RNA test, then multiple negative RNA tests on following days after the positive; and three persons that had a positive SARS-CoV-2 RNA test (all adults > 28d after symptom onset). Four out of five of these S-RBD and N-Nt ELISA negatives in the COVID-19+ plasma group were tested with a commercial, FDA-emergency-use-authorized (EUA) serological test. Three were tested with both the Abbott SARS-CoV-2 IgG (Abbott Diagnostics, IL, USA) and the EUROIMMUN Anti-SARS-CoV-2 ELISA IgG (Lüebeck, Germany) IgG tests, carried out as directed by the manufacturers, and were negative for SARS-CoV-2 antibodies by both serological tests. One combined ELISA negative plasma was tested with the InBios International Inc. SCoV-2 Detect IgG ELISA (Seattle, WA, USA) and was also negative for that SARS-CoV-2 antibodies test. The person with the + SARS-CoV-2 RNA test the same day did not get serological testing. Thus four out of five negatives in our N-Nt and S-RBD combined ELISA in the COVID-19+ group were also negative in at least one FDA-EUA serologic test, and the other did not have serological testing.

Of the 75 positives in our combined N-Nt and RBD ELISA test, 74 had FDA-EUA antibody testing of the same specimens. Fifty-four of the combined ELISA positives were tested with both the EUROIMMUN Anti-SARS-CoV-2 ELISA IgG, and the Abbott SARS-CoV-2 IgG tests and three were negative in both of those SARS-CoV-2 antibody tests. In addition to that, one plasma sample from the COVID-19+ tested negative and two were equivocal for only the EUROIMMUN test, but were positive on our combined ELISA test and the Abbott test. Five combined N-Nt and RBD positive plasma from the COVID-19+ group were tested with the EUROIMMUN test only and were all positive. Thirty-nine COVID-19+ , combined N-Nt and RBD positive plasma were tested only with the InBios SCoV-2 Detect IgG ELISA and all were positive. Thus of 74 COVID-19+, combined N-Nt and RBD positive plasma that were tested with a FDA-EUA commercial SARS-CoV-2 test, 71 (95%) were positive in at least one other FDA-EUA commercial test. This suggests the combined Nt-N and S-RBD ELISA test may be slightly more sensitive than the other FDA-EUA serological tests, though larger numbers of samples must be tested to make any firm conclusions.

These results were combined with a recent report from an extensive serological study on SARS-CoV-2 proteins using LISP4 and compared to our in-silico predictions (Table 1). The experimental data appeared to validate our in-silico predictions for Nuc, N-Nt, spike S1, S-RBD, nsp1, nsp7, nsp8, nsp15, ORF3a, ORF6, ORF8 and ORF10. However, the LISP study and others have reported that Nuc is specific2. Our conservation analysis shows that the predicted dominant epitope nprot_10 on the C-terminal domain is matching at least 22 endemic HCoV strains from Seattle isolates, dated between 2015 and 2019, with 67% identity, which might explain some of the non-specific responses to full-length N in our samples (Table S6). The experimental data also contradicted our predictions for M, nsp16 and ORF7a. While M and ORF7a were not predicted to be good candidates, this was not the case for nsp16, which was in our top 5 selection. Although the ELISA was performed on the nsp10/nsp16 complex, the structure shows that the putative dominant epitope on nsp16 is in a loop away from the interface with nsp10 and therefore the lack of response cannot be attributed to masking due to complex formation. The remaining validation on E, nsp9, nsp10 and ORF7b was deemed inconclusive due to weak signal. Full length spike has been recently reported as specific in other studies3 and further validation is needed.

B-cell antigenicity is influenced by the antigen abundance during infection and the way it is displayed to the immune system. Our prediction method does not address the former and is only a crude approximation of the latter. For example, the conservation of discontinuous epitopes was not considered in this study as it involves the comparison of surface areas, which might be disrupted by local insertions. Current in-silico predictions suffer from sensitivity issues and this will only improve with larger training sets, i.e. more experimental structures of antigen–antibody complexes8. Nonetheless, our combined predictions accurately ranked the spike RBD and N-term nucleoprotein as top choice candidates for specific SARS-CoV-2 diagnostics, which was subsequently confirmed by ELISA serological tests. This prediction method can speed the design of serodiagnostics for future pandemics, as demonstrated in this test case for SARS-CoV-2/COVID-19 diagnostics. It can in principle incorporate results from any epitope prediction tool, as long as results are reported as one discrete value per residue. Raw data for estimating the effects of variants can alternatively be sourced directly from public databases, such as NextStrain16, ViPR17 and the CNCB18. However, due to data restrictions at GISAID, sequence alignments cannot be retrieved from those resources; we chose to perform our own variant analysis to be able to inspect protein alignments and retrieve strain information from individual sequences.

The ELISA described is a low-cost method to sensitively and specifically detect antibody to SARS-CoV-2. Both antigens are high expressors, though one is made in E. coli (N-Nt) and one in mammalian cells (S-RBD). We estimate the cost to be only pennies per test. In comparison, the total cost of commercial tests to the patient ranges from $40 to$140 per test, including \$10/test for the reagents alone. We have shipped this test to eight low-to-middle income countries as a cost-effective test to monitor their infection rates and for research purposes. Further validation to address specificity is required prior to use for patient management, especially in low prevalence populations. However, the low frequency of reactivity in 106 individuals suggests that antibodies to more frequent viruses such as influenza, RSV, CMV, HSV and others are not cross-reactive with these antigens.

### Reagent availability

Samples of N-Nt and S-RBD proteins have been deposited in the BEI Resources repository and can be acquired from https://www.beiresources.org/. The amount of N-Nt and S-RBD in one tube (1 mg) would support the coating of 10,000 wells (individual assays, if not in duplicate or triplicate) for ELISA testing.

### Data and code availability

The file containing the raw data and the associated code to produce all the charts in Figure S1 are available on https://github.com/ssgcid/sc2_epitopes .

## Materials and methods

### Linear epitope analysis

Unless otherwise specified, the reference genome throughout this study is from the severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1 (GenBank: MN908947.3). The corresponding protein sequences were selected as follows: polyprotein ORF1ab (UniProt accession: P0DTD1), spike (P0DTC2), ORF3a (P0DTC3), E envelope protein (P0DTC4), M membrane protein (P0DTC5), ORF6 (P0DTC6), ORF7a (P0DTC7), ORF7b (P0DTD8), ORF8 (P0DTC8), N nucleoprotein (P0DTC9), ORF9b (P0DTD2), ORF10 (A0A663DJA2) and ORF14 (P0DTD3). The 15 mature products of polyprotein cleavage, nsp1, nsp2, nsp3 PL-PRO, nsp4, nsp5 3CL-PRO, nsp6, nsp7, nsp8, nsp9, nsp10, nsp12 RdRp, nsp13 helicase, nsp14 ExoN-N7-MTase, nsp15 NendoU and nsp16 MTase, collectively referred to as non-structural proteins (NSPs), were obtained from the chain features annotation of polyprotein ORF1ab on ViralZone19.

The comprehensive database of endemic human coronavirus (HCoV) protein sequences was built from 3 sources: NCBI, UniProtKB and ViPR. The UniProtKB and NCBI Identical Protein Groups databases were searched by the NCBI taxonomy identifiers for human coronaviruses strains HKU1, 229E, OC43 and NL63. Non-human strains sequences from taxonomy subnodes of strain 229E were excluded. Sequence variants manually annotated in the UniProtKB Swiss-Prot entries were extracted with VARSPLIC20. ViPR sequences were obtained by searching the taxonomy tree on the Coronaviridae Gene/Protein search form for strains HKU1, 229E and NL63, and by performing a search by keyword for ["OC43" or " HCoV-OC43"], followed by parsing of the Organism and Strain fields from the fasta headers and grouping by values. Non-human coronavirus sequences and fasta records with no protein sequences were discarded. Although NCBI, UniProtKB and ViPR datasets overlap, we found that each contained unique sequences: One for NCBI, 12 for ViPR and 51 for UniProtKB. These unique sequences could potentially influence results since a single variant on a 7-residue epitope has a 14% impact on the conservation score. The combined dataset of 15,188 sequences was reduced to 1,740 after redundancy elimination and clustering with CD-HIT to remove 100% overlapping fragments21. Polyprotein sequences were identified by blastp similarity searches against the reference and NSPs extracted to avoid false predictions across boundaries, yielding 1,138 unique NSP sequences. These were added back to the non-polyprotein dataset. After removal of 27 sequences with known artificial mutations (originating from the PDB), the final HCoV reference database contained 2,234 protein sequences.

Linear B-cell epitopes were predicted for the 12 non-polyproteins and 15 NSPs of the reference SARS-CoV-2 isolate with BepiPred2.0, using a minimum length of 7 amino acid residues and minimum score of 0.55 (80% specificity/30% sensitivity)5. To reduce the chance of introducing bias towards lower conservation score due to length, long epitopes were trimmed or split to 15 residues prior to scanning for conservation against the endemic HCoV database, by increasing the local BepiPred2 threshold until the stretch of continuous high scores was at most 15 amino acid long. Epitope conservation was estimated by aligning the epitope against each sequence in the endemic HCoV database using a sliding window and calculating the pairwise identity at each position, averaged over the length of the epitope. Epitope conservation was defined as the maximal pairwise identity value obtained after scanning was completed.

### Conformational epitope analysis

Coordinates for experimental structures of SARS-CoV-2 proteins were downloaded from the PDB. In case multiple structures were available, the apo structures with highest sequence coverage and highest resolution were selected. To ensure the highest possible coverage of the SARS-CoV-2 structome, a repository of homology models was built at the onset using Robetta (https://www.ssgcid.org/cttdb/molecularmodel_list/?target__icontains=BewuA).

Where no PDB structure was available, high confidence (overall confidence score > 0.5, local coordinate error < 3Å) Robetta models were used instead. In case no template was available for homology modelling, de-novo models were obtained from CASP-commons (https://predictioncenter.org/caspcommons/). Top scoring models as ranked by ProQ3D EMA22 were selected that also had the highest local score at epitope locations.

Conformational B-cell epitopes were predicted with DiscoTope2 with a significance threshold of -2.5 (80% specificity/40% sensitivity)23. Scores were scaled at zero for plotting. The dominance score for linear epitopes was defined as the DiscoTope2 score averaged over the length of the linear epitope. Epitopes that were discontinuous due to gaps in the structure were excluded. Conformational epitopes with no corresponding linear epitope were inspected for regions with continuous signal. Dominance and conservation were calculated over these regions if they spanned at least 6 residues. Regions longer than 15 residues with diffuse boundaries were excluded.

### SARS-CoV-2 variant analysis

A total of 7,948 high coverage sequences from human isolates were downloaded from GISAID on 04/20/2020, representing 6,797 unique DNA sequences. These were annotated as ‘sequences with < 1% Ns and < 0.05% unique amino acid mutations (not seen in other sequences in the database) and no insertion/deletion unless verified by the submitter’. The unique sequences were aligned to the reference genome with MAFFT experimental version 7.463 using options ‘–auto –addfragments’24,25. The misaligned 3.2 kb fragment EPI_ISL_426413 was identified as Hepatitis B virus isolates JRC-HB01 by MEGABLAST against the NCBI nr database and was removed from the alignment. Further, a total of 40 inserts in the alignment that introduced gaps in the reference sequence were deleted as likely sequencing errors. The 11 largest inserts ranged from 99 to 155 bases, and the next largest were 3 base long. The genome submitter confirmed that those large inserts were likely assembly artefacts that will be corrected. Of the smaller inserts, a three base ‘TTT’ in-frame insertion was observed at nsp6 position 298 in 17 sequences, and is presumably real, although it is in a repetitive region of 8 consecutive Ts and therefore at the limit of accurate PCR amplification of mono-thymidine repeats26. Gaps in aligned sequences were likewise treated as likely sequencing errors and replaced with the unknown base to keep translation in frame with respect to the reference sequence, taking into account the ribosomal frameshift in the nsp12 coding sequence. This process resulted in just 95 remaining premature terminations in over 160 K proteins; 43 of these were found consistently after amino acid 125 of nsp10 and likely represent true truncations. The others were distributed across the sequences of 11 protein families and are likely erroneous. The 27 protein alignments thus obtained, one for each of the SARS-CoV-2 proteins selected for this study, were transformed into a n by m matrix, where n is the length of the reference protein and m the length of the alignment, and processed column-wise. At each position in the alignment, the variant count was obtained by counting unique amino acids that differed from the reference; variability was quantified using the Shannon entropy, calculated as $$SE= -{\sum }_{i=1}^{n}{P}_{i}\times {log}_{2}\left({P}_{i}\right)$$ where n is the number of amino-acid types and Pi is the fraction of amino acid type i at that position27,28. Codes for unknown amino acid (X) and translation termination (*) were ignored in all calculations.

### Epitope scoring table

Supplementary Materials (Table S4). Epitopes with the highest dominance score and highest specificity were ranked at the top: first the epitopes defined by both the linear and conformational predictions; second, the conformational epitopes where the linear prediction was below the threshold; third, the specific dominant epitope with high variant density. This group was followed by epitopes located on flexible segments, ranked by specificity. The remaining linear epitopes were ranked by specificity, average BepiPred2 score and variant density. At the bottom of the list were the least specific epitopes on flexible segments, followed by those with the highest dominance score.

### Protein expression and purification

pET28 vector constructs encoding full length SARS-CoV-2 Nucleoprotein, nsp7, nsp8, nsp9, nsp10, nsp15 and nsp16 were ordered from Genscript. The nucleotide sequences were codon optimized for E.coli expression and in frame with an N terminus 6XHistidine metal affinity tag with a TEV protease cleavage site. Additionally, a pMCSG53 vector construct encoding the N terminus of SARS-CoV-2 Nucleoprotein (47–173), with an N terminal 6X Histidine metal affinity tag and TEV protease cleavage site was gifted from the Center for Structural Genomics of Infectious Diseases (CSGID).

Plasmids were transformed into BL21 (DE3) Rosetta Oxford chemically competent E.coli cells, and selected colonies were re-streaked on LB agar with appropriate antibiotics.

For protein production, starter cultures of PA-0.5G [50 mM Na2HPO4 50 mM KH2PO4 25 mM (NH4)2SO4 2 mM MgSO4 0.2 × trace metals 0.5% glucose 200 µg/mL 18 amino acids] non-inducing media with appropriate antibiotics were grown for 18 h at 25 °C. Antibiotics were added to 2 L bottles of sterile ZYP-5052 [1% tryptone 0.5% yeast extract 50 mM Na2HPO4 50 mM KH2PO4 25 mM (NH4)2SO4 2 mM MgSO4 0.2 × trace metals 0.5% glycerol 0.5% glucose 0.2% lactose] autoinduction media, and the bottles were inoculated with overnight cultures29.

Inoculated bottles were then placed into a LEX bioreactor and cultures grown for 72 h at 20 °C. To harvest, the medium was centrifuged at 6000 RCF for 30 min at 4 °C. Cell paste was frozen and stored at -80 °C prior to purification.

Nsp7, nsp8, nsp9, nsp10, nsp16, nsp15, full-length nucleoprotein and the N-terminal truncation representing the RNA-binding domain (amino acids [AA] 47–173) of nucleoprotein (N-Nt) were all produced in E. coli and complexes assembled by co-lysis of recombinant proteins produced in individual E. coli. For purification, cells were resuspended in lysis buffer [25 mM HEPES (pH 7.0) 500 mM NaCl 5% glycerol 0.5% CHAPS 30 mM imidazole 10 mM MgCl2 1 mM TCEP 1 mM AEBSF 0.025% NaN3] and lysed by sonication. The cell lysate was then cleared by centrifugation at 26,000 g for 45 min at 4 °C. Each recombinant protein was purified from the cell lysate by immobilized metal affinity chromatography. A Ni–NTA column (GE Healthcare, HisTrap FF) was equilibrated in lysis buffer to which the lysate was added. The column was washed with 150 mL of wash buffer supplemented with [25 mM HEPES (pH 7.0) 500 mM NaCl 5% glycerol 30 mM imidazole 1 mM TCEP 0.025% NaN3] and the protein was eluted from the column by the addition of wash buffer supplemented with 350 mM imidazole. In each case, the protein was further purified by size-exclusion chromatography on a GE HiLoad 26/600 Superdex 75 PG or a Superdex 200 PG column previously equilibrated with size-exclusion buffer [25 mM HEPES (pH 7.0) 500 mM NaCl 5% glycerol 2 mM DTT 0.025% NaN3]. Fractions containing the final protein, in a pure state, were pooled and concentrated to 5 mg/mL before being snap frozen in liquid nitrogen and stored at -80 °C.

The prefusion-stabilized trimeric spike (S-RBD) construct was produced at the Institute for Protein Design, Seattle, WA. The expression plasmid was received from Dr. David Veesler, UW Biochemistry, Seattle, WA, with discovery and characterization information as previously described11. The spike protein was expressed using transient transfection in Thermo Expi HEK293 mammalian cells per IPD protocol, harvested after 3 days incubation at 33 °C, and clarified using PDAD (Sigma-Aldrich #409,014). The clarified harvest fluid (CHF) was purified using TaKaRa Talon immobilized metal affinity chromatography (TaKaRaBio 635,504) and buffer exchanged into Tris Buffered Saline at 2–8 °C. The protein was characterized using negative stain Electron Microscopy, SDS-PAGE, and UV–Vis.

### Serological assays (ELISA validation)

Recombinant proteins refer to SARS-CoV-2 nsp7/8, nsp9, nsp10/nsp16, spike-RBD, Nucleoprotein full length, N-Nt, and a mix of S-RBD and N-Nt. The recombinant proteins were diluted to 2 µg/mL into sterile 1 × PBS and then used to coat ELISA plates (Immulon 4HBX Flat Bottom Microtiter Plates 96-wells, 50 plates/case, ThermoScientific, Rochester, NY), using 100 µL of 2 µg/mL into each well, covered, and incubated at 4 °C at least overnight (up to seven days). For the combined ELISA, we coated with both S-RBD at 1 µg/mL and N-Nt at 1 µg/mL in 100 µL PBS in each ELISA well. After incubation at 4 °C at least overnight, the plates were washed with a plate washer 2 × with 300 µL per wash (wash solution for this and all washes: PBS with 0.1% Tween20). Then we added 200 µL of 2% BSA (20 mg/mL) in PBS to each well to quench binding, incubating at least 1 h at room temp with shaking. We then washed with a plate washer 2 × at 300 µL per wash. We heat-treated all plasma or sera at 56 °C for 60 min. Tubes were cooled on ice after heat treatment and spun at 10,000 RPM in a 4 °C microfuge for 10 min, saving the supernatant. The supernatant was then diluted to working concentration (1:100) in PBS-0.1% Tween20-2% BSA and spun again at top speed in a 4 °C microfuge (13,000 RPM or 17,000 × g) for 30 min, and the supernatant was recovered for the next step. We then added 50 µL per well of 1:100 diluted plasma or sera, diluted in PBS-0.1% Tween20-2% BSA and then incubated for 90 min at room temp, while shaking. We then washed the plates with a plate washer 3 × 300 µL per wash. We then added 100 µL per well of anti-IgG-HRP (Rockland anti-human-IgG(H + L) goat antibody horseradish peroxidase (HRP) conjugated, Cat# 609–1302) (diluted in same buffer as the sera/plasma), and incubated 1 h at room temp while shaking. After that incubation, we washed with a plate washer 3 × at 300 µL per wash. We then added 100 µL of developing solution (SigmaFAST OPD tablets, Sigma, St. Louis, MO, prepared as per manufacturer), allowed 30 min in the dark at room temperature, and added 25 µL stop solution (3 N HCl). The plates were then covered and read on plate reader at OD 492 nM. Positives were defined as those with OD492 that are greater than the (mean + 3 times standard deviation of mean) of the negative controls. The ELISA was run each time with 10 to 20 control plasma/sera for this calculation.

The samples from those persons that tested positive for SARS-CoV-2 were derived from plasma or sera from adults (≥ 18 yo) of both genders, provided without personal identifiers, from Seattle Children’s SARS2 Recovered Cohort (N = 16) and a convalescent cohort screened for eligibility for plasma donation (N = 57). In addition, seven specimens were residual clinical samples obtained from inpatients at the University of Washington. Forty-two negative control plasma and serum samples were from banked samples collected from healthy male and female adult (> 21 yo) volunteers who had previously participated in PK studies conducted earlier than November 2019 by Dr. Nina Isoherranen’s UW group. A second group of negative control specimens included thirty sera samples from adults participating in University of Washington Virology Research Clinic protocols studying Zoster vaccination earlier than November 2019. A third group of negative controls were 29 serum specimens from the UW Department of Laboratory Medicine that tested negative for COVID-19 in the Abbott Architect SARS-CoV-2 IgG Assay30. All negative control samples were obtained earlier than November 2019. All serum and plasma samples were collected under either the University of Washington or the Seattle Children's IRB approval.

### Approval for human experiments statement

All methods were carried out in accordance with relevant guidelines and regulations. All human serum and plasma samples were obtained from donors over 18 yo with informed consent under protocols approved by the University of Washington or Seattle Children’s IRB.

## References

1. 1.

Dahlke, C. et al. Distinct early IgA profile may determine severity of COVID-19 symptoms: an immunological case series. medRxiv https://doi.org/10.1101/2020.04.14.20059733 (2020).

2. 2.

Guo, L. et al. Profiling early humoral response to diagnose novel coronavirus disease (COVID-19). Clin. Infect. Dis. https://doi.org/10.1093/cid/ciaa310 (2020).

3. 3.

Amanat, F. et al. A serological assay to detect SARS-CoV-2 seroconversion in humans. Nat. Med. https://doi.org/10.1038/s41591-020-0913-5 (2020).

4. 4.

Hachim, A. et al. Beyond the Spike: identification of viral targets of the antibody response to SARS-CoV-2 in COVID-19 patients. medRxiv https://doi.org/10.1101/2020.04.30.20085670 (2020).

5. 5.

Jespersen, M. C., Peters, B., Nielsen, M. & Marcatili, P. BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes. Nucleic Acids Res. 45, W24–W29 (2017).

6. 6.

Grifoni, A. et al. A sequence homology and bioinformatic approach can predict candidate targets for immune responses to SARS-CoV-2. Cell Host Microbe https://doi.org/10.1016/j.chom.2020.03.002 (2020).

7. 7.

Wang, H. et al. SARS-CoV-2 proteome microarray for mapping COVID-19 antibody interactions at amino acid resolution. bioRxiv https://doi.org/10.1101/2020.03.26.994756 (2020).

8. 8.

Jespersen, M. C., Mahajan, S., Peters, B., Nielsen, M. & Marcatili, P. Antibody specific B-cell epitope predictions: leveraging information from antibody-antigen protein complexes. Front. Immunol. https://doi.org/10.3389/fimmu.2019.00298 (2019).

9. 9.

Jia, Z. et al. Delicate structural coordination of the Severe Acute Respiratory Syndrome coronavirus Nsp13 upon ATP hydrolysis. Nucleic Acids Res. 47, 6538–6550 (2019).

10. 10.

Kim, Y. et al. Crystal structure of Nsp15 endoribonuclease NendoU from SARS-CoV-2. Protein Sci. https://doi.org/10.1002/pro.3873 (2020).

11. 11.

Walls, A. C. et al. Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell https://doi.org/10.1016/j.cell.2020.02.058 (2020).

12. 12.

Yin, W. et al. Structural basis for inhibition of the RNA-dependent RNA polymerase from SARS-CoV-2 by remdesivir. Science https://doi.org/10.1126/science.abc1560 (2020).

13. 13.

Park, H. et al. High-accuracy refinement using Rosetta in CASP13. Proteins 87, 1276–1282 (2019).

14. 14.

Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl. Acad. Sci. U.S.A. 117, 1496–1503 (2020).

15. 15.

Westhof, E. et al. Correlation between segmental mobility and the location of antigenic determinants in proteins. Nature 311, 123–126 (1984).

16. 16.

Hadfield, J. et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34, 4121–4123 (2018).

17. 17.

Pickett, B. E. et al. ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res. 40, D593-598 (2012).

18. 18.

Zhao, W.-M. et al. The 2019 novel coronavirus resource. Yi Chuan 42, 212–221 (2020).

19. 19.

Hulo, C. et al. ViralZone: a knowledge resource to understand virus diversity. Nucleic Acids Res. 39, D576-582 (2011).

20. 20.

Kersey, P., Hermjakob, H. & Apweiler, R. VARSPLIC: alternatively-spliced protein sequences derived from SWISS-PROT and TrEMBL. Bioinformatics 16, 1048–1049 (2000).

21. 21.

Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).

22. 22.

Uziela, K., Menéndez Hurtado, D., Shu, N., Wallner, B. & Elofsson, A. ProQ3D: improved model quality assessments using deep learning. Bioinformatics 33, 1578–1580 (2017).

23. 23.

Kringelum, J. V., Lundegaard, C., Lund, O. & Nielsen, M. Reliable B cell epitope predictions: impacts of method development and improved benchmarking. PLoS Comput. Biol. 8, e1002829 (2012).

24. 24.

Katoh, K. MAFFT—a multiple sequence alignment program. https://mafft.cbrc.jp/alignment/software/closelyrelatedviralgenomes.html (2020).

25. 25.

Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

26. 26.

Clarke, L. A., Rebelo, C. S., Gonçalves, J., Boavida, M. G. & Jordan, P. PCR amplification introduces errors into mononucleotide and dinucleotide repeat sequences. Mol. Pathol. 54, 351–353 (2001).

27. 27.

Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).

28. 28.

Strait, B. J. & Dewey, T. G. The Shannon information entropy of protein sequences. Biophys. J. 71, 148–155 (1996).

29. 29.

Studier, F. W. Protein production by auto-induction in high density shaking cultures. Protein Exp. Purif. 41, 207–234 (2005).

30. 30.

Bryan, A. et al. Performance characteristics of the Abbott Architect SARS-CoV-2 IgG Assay and Seroprevalence in Boise, Idaho. J. Clin. Microbiol. https://doi.org/10.1128/JCM.00941-20 (2020).

## Acknowledgements

We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID’s EpiFlu Database on which this research is based. The list is detailed in Suppl. Table S7. We thank Christopher L. McClurkan and Victoria L. Campbell for assistance with specimen processing, Winnie Yeung, Chihiro Morishima, and Stacey Selke for their assistance in accessing commercial FDA-EUA serological data, Ryan Choi for help with ELISA graphs, all of SSGCID for excellent support, as well as Elisabeth Gasteiger and Richard Scheuermann for helpful discussions. We thank the University of Washington, Department of Laboratory Medicine, Northwest Biospecimens Repository for providing serum and plasma samples. We also thank The Audacious Project at the Institute for Protein Design for supporting the Robetta infrastructure and the CASP Commons competitors and organizers for making the de-novo models openly accessible.

## Funding

This project has been funded in part with Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under NIH contract HHSN272201700059C, and NIH Grant R01AG064800 (Koelle), NIH/NIAID - K08 AI135072 & Burroughs Wellcome Fund CAMS 1017213 (Harrington), NIH/NIAID 1 U01 AI151698-01 (Van Voorhis, Gale, Rabinowitz, and Wasserheit), the Bill & Melinda Gates Foundation OPP1156262 (King), the NIMGS/NIH R01GM120553 (Veesler), a Pew Biomedical Scholars Award (Veesler) and Burroughs Wellcome Investigators in the Pathogenesis of Infectious Diseases awards (Veesler) and Fast Grants (Veesler).

## Author information

Authors

### Contributions

I.Q.P. conceived, performed and interpreted the bioinformatics analysis, and wrote the original draft. S.S. assisted in the bioinformatics analysis. D.K. modelled the template-based structures. L.C., M.M., and D.P. expressed the RBD protein and provided quality control. I.A. modelled the TrRosetta structures and provided support for ab-initio model analysis. L.K.B. provided laboratory support and edited the manuscript. R.S. produced and quality checked antigen for the investigations and edited the manuscript. J.C. and L.T. helped produce antigen for the investigations and edited the manuscript. D.V. designed the construct to produce the RBD. N.P.K. planned the production of antigens and edited the manuscript. W.E.H., D.M.K., A.W., J.B., N.I., A.L.G., K.R.J. and H.C. provided resources such as patient samples and clinical information and edited the manuscript. B.S. assisted in structure analysis. L.S. provided laboratory support. P.J.M. supervised the study. W.C.V.V. planned and carried out the ELISA studies, drafted the manuscript along with I.Q.P., and edited the manuscript. All authors reviewed the manuscript.

### Corresponding author

Correspondence to Wesley C. Van Voorhis.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

### Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Phan, I.Q., Subramanian, S., Kim, D. et al. In silico detection of SARS-CoV-2 specific B-cell epitopes and validation in ELISA for serological diagnosis of COVID-19. Sci Rep 11, 4290 (2021). https://doi.org/10.1038/s41598-021-83730-y

• Accepted:

• Published: