Introduction

The COVID-19 pandemic has highlighted the importance of diagnostic tools that are accurate and cost-effective. So far reports based on linear epitope scanning have been limited in scope to a few selected proteins1. Recent antibody response results are based on single SARS-CoV-2 proteins2,3, or limit testing of the polyprotein to a single domain4. We propose a simple method to predict likely candidates for serological diagnostics based on existing prediction tools for B-cell epitopes that leverages available structural and sequencing data.

Results and discussion

Epitope predictions

One of our aims is to propose a visual aid to select suitable targets for the serological diagnosis of a viral pathogen. Results come in the form of a Summary Plot for each viral protein (Suppl Figure S1).

First, linear epitopes were predicted on the sequence of each protein with BepiPred25. At the chosen threshold (80% specificity/30% sensitivity)6, epitopes were detected for 2/3 of the proteins. Distributions per protein and length show that over half the predicted epitopes were located on only 4 proteins: S, nsp3, N and nsp12 (Fig. 1A) and that over 80% of the epitopes were below 15 residues long (Fig. 1B). The raw epitope scores plotted against the sequence illustrates how heavily the threshold affects the prediction (Figure S1 A). This is a known limitation of the predictor which achieves high specificity at a high sensitivity cost. Nonetheless, strong signals were observed for the nucleoprotein, consistent with several recent studies2,7.

Figure 1
figure 1

Linear epitope BepiPred2 predictions: (A) summary count per protein with at least one predicted epitope; (B) length distribution.

Second, we assessed the level of sequence conservation of each SARS-CoV-2 epitope compared to endemic human coronaviruses (HCoVs) strains HKU1, 229E, OC43 and NL63. For this purpose, a comprehensive database of endemic HCoVs protein sequences was built from 3 different sources: NCBI, UniProt and ViPR (Table S2). Prior to performing the comparison, each epitope was trimmed to 15 residues to avoid length bias. Trimmed epitopes were scanned using a sliding window against the endemic HCoV sequences. The average pairwise identity between the epitope and aligned residues was recorded at each position. Conservation was defined as the maximal value obtained after scanning was completed.

Conservation levels ranged between 40 and 100% with large local variations (Figure S1 B). Perhaps unsurprisingly, the most conserved epitopes were located in enzymes that perform essential functions in viral replication, such as the nsp12 RNA-dependent RNA polymerase, and are thus likely to be conserved across species (Fig. 2). For proteins with less conserved epitopes overall, the average masked large local differences. In particular, the nucleoprotein contained epitopes in both the N-terminal RNA-binding and C-terminal dimerization domains, but only the C-terminus includes highly conserved, non-SARS-CoV-2-specific epitopes.

Figure 2
figure 2

Distribution of conservation levels (1 = 100% identity) of SARS-CoV-2 linear epitopes compared to our comprehensive database of protein sequences from HCoV endemic strains. Epitopes located on essential enzymes such as the RdRp and helicase are the most conserved.

Third, conformational B-cell epitopes were predicted with DiscoTope2 on molecular structures of SARS-CoV-2 proteins8. The unprecedented speed at which experimental SARS-CoV-2 structures have been solved and deposited in the PDB allowed us to predict epitopes on 13 proteins (Table S3)9,10,11,12. Coverage was further extended to all proteins except E envelope protein, ORF9b and ORF14, by including high-quality homology models obtained with the Robetta platform and ab-initio models from the CASP-commons competition (Table S3)13,14. Since the Robetta models were based on PDB templates with high sequence similarity (e.g. SARS-CoV homologs with greater than 90% sequence identity), they are likely to have near atomic level accuracy in the range of less than a few angstroms root mean square deviation (RMSD). For example, the Robetta model for the closed state of the spike trimer is within 2 angstroms RMSD to the experimentally determined structure (6VXX) which was released after the models were generated. The CASP models are de-novo predictions and thus the prediction of conformational epitopes on these models are not directly based on experimental structure determination.

DiscoTope2 scores were plotted on the sequence to show the overlap between conformational and linear epitopes (Figure S1 C). Seven linear epitopes, one on nsp14, and six on spike, had no structural coverage because they matched gaps in the structures, which are indicative of highly flexible segments of the protein. Correlations between mobile segments and antigenicity are well known15. Provided they are not masked by another dominant epitope, we might therefore expect these epitopes to elicit a strong immune response even on the fully folded protein. One of these, 469-STEIYQAGST on the receptor-binding domain of the spike protein (spike RBD), could be of particular interest as conservation analysis indicates that it is SARS-CoV-2 specific.

To predict whether a linear epitope might be dominant, we assigned a ‘dominance score’ to regions where linear and conformational models both predicted an epitope. Six potentially dominant epitopes were thus predicted: nprot_5, nsp3_21 and nsp16_4 are likely to be specific with a conservation level below 0.6 (i.e. specificity above 0.4); nsp2_4, nprot_10 and nprot_11 are likely non-specific (Fig. 3). Interestingly, three of these epitopes are located on the nucleoprotein: the specific epitope is on the N-terminal RNA-binding domain, while the two non-specific epitopes are on the C-terminal dimerization domain. This observation suggests that removal of C-terminal domain might be required to achieve specificity from endemic coronaviruses when assaying antibodies against the nucleoprotein.

Figure 3
figure 3

Dominance score vs specificity of predicted linear B-cell epitopes (blue) and structural epitopes (orange). Specificity is calculated as 1—conservation. The dominance threshold (DiscoTope2 minimal score scaled to 0) is shown as a green line. Potentially dominant epitopes are labelled. Scores below − 3 have been omitted for clarity.

Conformational epitopes were also predicted in regions where the linear epitope signal was below the threshold. Four of these structural epitopes were specific with a high dominance score, including nsp3_767s on the papain-like protease and spike_489s on the spike RBD. Lack of linear epitope signal could suggest that an immune response to these epitopes might only be achieved with the native protein or a structural mimic. However, it is also conceivable that the linear prediction tool simply failed to detect these epitopes.

Finally, we analyzed the SARS-CoV-2 genomes in the GISAID repository to locate variants that might affect our specificity predictions. A conservative approach was adopted to extract variants from an alignment of 6,796 unique GISAID sequences (downloaded on 04/20/2020), where each step sought to minimize erroneous variant calls due to sequencing errors. The degree of variability was further quantified by calculating the Shannon entropy at each position in the alignment. Variants were observed in all but 10 of the predicted linear epitopes (Figure S1 D and E). Unsurprisingly, no variants were observed in the nsp12_20 epitope that is 100% conserved in the endemic HCoVs. Shannon entropy identified eleven positions with conserved variants on 9 proteins. Sites with the highest Shannon Entropy values were nsp12 L323P and spike G614D, with a variant frequency of 40% in both cases. The spike variant is located in the 601–640 region that has been previously reported as 78% identical to a dominant B-cell epitope on SARS-CoV6. However, G614 is in a buried beta-sheet in the spike cryo-EM structure and no linear or structural epitope was predicted at that position. Two conserved variants were detected in predicted epitopes. The first is G251V on epitope orf3a_5 249-IDGSSGVVNP observed in 12% of ORF3a sequences; however, orf3a_5 is non-specific and is probably not a good diagnostic antigen candidate. The second is P586S on epitope nsp2_13 584-LQPLEQPTSEAVE observed in 5% of nsp2 sequences. Epitope nsp2_13 is specific; however, we found that the substitution had no effect on linear epitope prediction results or specificity. While it is straightforward to analyze the effect of single mutations, the exercise becomes more complex if multiple positions vary. To predict whether an epitope might be susceptible to sequence variations, we assigned a variant density score that reflects how many positions had at least one observed variant and compared this value to the overall variation frequency (Fig. 4). The resulting plot reveals at a glance the epitopes with a conserved variant discussed above, orf3a_5 and nsp2_13. Figure 4 also highlights three epitopes with infrequent variants (low SE100) but where 80% or more of positions are susceptible to variations, nprot_8_1, orf8_3, and nprot_322s. These epitopes might be too variable to consider for diagnostic use.

Figure 4
figure 4

Amino acid variation in predicted epitopes, as average variability (SE100 = 100*Shannon entropy over epitope length) vs variant density. Epitopes with variants detected in over 80% of residues or variability above 5 are labelled. The high SE100 of orf3a_5 and nsp2_13 are due to single conserved variants.

Epitope ranking

The summary plots provide a visual guide to evaluate the suitability of each protein for serological diagnostics, based on the raw predictions results displayed along the sequence. Favorable epitopes combine continuous BepiPred2 scores above the threshold, low conservation vs endemic HCoVs, continuous DiscoTope2 scores above the threshold, low variability and low density of variants. Our rationale for ranking individual epitopes based on derived data such as dominance and variant density scores is reflected in the scoring table (Table S4). Epitopes are classified into one of four broad categories: specific with high dominance score, specific linear, non-specific with high dominance score and non-specific linear. For each protein, we compiled the numbers of epitopes per category to predict whether the protein is likely or not to elicit an immune response and whether that response is likely or not to be specific (Table 1). As illustrated in the table, candidates with no predicted epitopes above the thresholds are marked as ‘no response’. Our top candidates marked as ‘specific’ contain only specific epitopes and at least one of them has a high dominance score. Three proteins match these criteria: the N nucleoprotein RNA-binding domain, the nsp3 papain-like viral protease domain and the nsp16 MTase. Next are the proteins that match the same criteria, but also contain non-specific linear epitopes: the S spike RBD and nsp3 full-length protein. We postulate that the immune response will likely be dominated by the specific epitopes with high dominance score as long as the protein is folded in native conformation. Altogether these 5 proteins would be our top candidates for serological diagnostics.

Table 1 Summary predictions of SARS-CoV-2 diagnostic candidates, with the numbers of predicted epitopes in each category. Flexible segments are classified as having a high dominance score. Susceptibility to variants indicates the presence of epitopes with conserved mutations or a variant density above 80%. Structure reliability is set to 3 for experimental structures (very reliable), 2 for homology models and 1 for de-novo models (unreliable). Top five candidates are highlighted with a black box.

Validation

We constructed a panel of 80 persons that had COVID-19 disease, all of whom reported having a positive nasopharyngeal swab for SARS-CoV-2 RNA. 71 of 80 persons in our cohort were convalescent with 21 days or more since symptom onset. Only two persons were likely acutely infected, one who had a blood draw the day of their SARS-CoV-2+ RNA test and another who had a blood draw 2 days after their SARS-CoV-2+ RNA test. The remaining 7 individuals were between 9 and 19 days of symptom resolution. We also gathered 106 serum and plasma “negative control” samples either collected before November 2019 (N = 77) or demonstrated negative on the Abbott Architect SARS-CoV-2 IgG Assay (N = 29).

From a subset of 6 COVID-19+ plasma and 5 “negative control” plasma, we compared our panel of recombinant SARS-CoV-2 proteins, including spike-receptor binding domain (S-RBD), nucleoprotein (full length), nsp9, nsp10/nsp16 complex, nsp15, and nsp7/nsp8 complex (Fig. 5). S-RBD and nucleoprotein full length both gave significantly more ELISA signal with the COVID-19+ patients’ serum/plasma samples than the negative control samples. However, the signal-to-noise ratio of full-length nucleoprotein was only about 2, consistent with the predictions that the C-terminal portion of nucleoprotein has epitopes that are conserved with low pathogenicity coronaviruses. We then expressed the N-terminal, RNA-binding domain of the nucleoprotein (N-Nt), AA residues 47–173 and used this as an antigen for ELISA. The robust signal was retained in the N-Nt ELISA, and the signal-to-noise increased to 4, a twofold increase.

Figure 5
figure 5

ELISA results (mean optical density at 492 nm [OD492], error bars are standard deviations of the mean) comparing plasma from 6 COVID-19 persons and 5 controls taken before November 2019 using anti-IgG conjugated to horseradish peroxidase (HRP). Performance of spike-receptor binding domain (S-RBD) and nucleoprotein in selective reaction with COVID-19 plasma IgG is far superior to nsp9, complexes of nsp7/8, complexes of nsp10/16, or nsp15. Note: the N-terminal portion of nucleoprotein (N-Nt) retains the superior signal of full-length nucleoprotein (Nuc) but improves the signal-to-noise ratio of the assay by nearly twofold, to 4.2 from 2.3, respectively. This implies that epitopes on the C-terminal portion of Nuc are non-specific for SARS-CoV-2, as predicted.

Testing 78 COVID-19+ sera/plasma, we found N-Nt gave a sensitivity of 87% (68 of 78 COVID-19+ positive); using 22 negative controls yielded 100% specificity. (Table S5). Testing S-RBD by ELISA yielded a sensitivity of 92% (72 of 78 COVID-19+ positive) and 100% specificity (0 of 22 negative controls tested positive). A few COVID-19+ persons were disconjugate in their seroreactivity, that is, were positive for one antigen in ELISA, but negative for the other.

Combining the two antigens, S-RBD and N-Nt, to coat ELISA wells detected all of the positive sera in both assays including the disconjugate ones, but also revealed a couple of COVID-19+ plasma that were negative for both antigens alone, that moved into the positive range. Thus, this combined S-RBD and N-Nt ELISA was run on all 80 COVID-19+ sera/plasma and all 106 negative control sera/plasma yielding 94% sensitivity (75 of 80 COVID-19+ persons were positive) and 97.2% specificity (three of the 106 negative control plasma or sera tested were positive).

The five negatives in the combined S-RBD and N-Nt ELISA assay were: one person with SARS-CoV-2 RNA positive test the same day (no records exist as to length of symptoms); a second person with COVID-19 symptom onset 17d prior, a single positive SARS-CoV-2 RNA test, then multiple negative RNA tests on following days after the positive; and three persons that had a positive SARS-CoV-2 RNA test (all adults > 28d after symptom onset). Four out of five of these S-RBD and N-Nt ELISA negatives in the COVID-19+ plasma group were tested with a commercial, FDA-emergency-use-authorized (EUA) serological test. Three were tested with both the Abbott SARS-CoV-2 IgG (Abbott Diagnostics, IL, USA) and the EUROIMMUN Anti-SARS-CoV-2 ELISA IgG (Lüebeck, Germany) IgG tests, carried out as directed by the manufacturers, and were negative for SARS-CoV-2 antibodies by both serological tests. One combined ELISA negative plasma was tested with the InBios International Inc. SCoV-2 Detect IgG ELISA (Seattle, WA, USA) and was also negative for that SARS-CoV-2 antibodies test. The person with the + SARS-CoV-2 RNA test the same day did not get serological testing. Thus four out of five negatives in our N-Nt and S-RBD combined ELISA in the COVID-19+ group were also negative in at least one FDA-EUA serologic test, and the other did not have serological testing.

Of the 75 positives in our combined N-Nt and RBD ELISA test, 74 had FDA-EUA antibody testing of the same specimens. Fifty-four of the combined ELISA positives were tested with both the EUROIMMUN Anti-SARS-CoV-2 ELISA IgG, and the Abbott SARS-CoV-2 IgG tests and three were negative in both of those SARS-CoV-2 antibody tests. In addition to that, one plasma sample from the COVID-19+ tested negative and two were equivocal for only the EUROIMMUN test, but were positive on our combined ELISA test and the Abbott test. Five combined N-Nt and RBD positive plasma from the COVID-19+ group were tested with the EUROIMMUN test only and were all positive. Thirty-nine COVID-19+ , combined N-Nt and RBD positive plasma were tested only with the InBios SCoV-2 Detect IgG ELISA and all were positive. Thus of 74 COVID-19+, combined N-Nt and RBD positive plasma that were tested with a FDA-EUA commercial SARS-CoV-2 test, 71 (95%) were positive in at least one other FDA-EUA commercial test. This suggests the combined Nt-N and S-RBD ELISA test may be slightly more sensitive than the other FDA-EUA serological tests, though larger numbers of samples must be tested to make any firm conclusions.

These results were combined with a recent report from an extensive serological study on SARS-CoV-2 proteins using LISP4 and compared to our in-silico predictions (Table 1). The experimental data appeared to validate our in-silico predictions for Nuc, N-Nt, spike S1, S-RBD, nsp1, nsp7, nsp8, nsp15, ORF3a, ORF6, ORF8 and ORF10. However, the LISP study and others have reported that Nuc is specific2. Our conservation analysis shows that the predicted dominant epitope nprot_10 on the C-terminal domain is matching at least 22 endemic HCoV strains from Seattle isolates, dated between 2015 and 2019, with 67% identity, which might explain some of the non-specific responses to full-length N in our samples (Table S6). The experimental data also contradicted our predictions for M, nsp16 and ORF7a. While M and ORF7a were not predicted to be good candidates, this was not the case for nsp16, which was in our top 5 selection. Although the ELISA was performed on the nsp10/nsp16 complex, the structure shows that the putative dominant epitope on nsp16 is in a loop away from the interface with nsp10 and therefore the lack of response cannot be attributed to masking due to complex formation. The remaining validation on E, nsp9, nsp10 and ORF7b was deemed inconclusive due to weak signal. Full length spike has been recently reported as specific in other studies3 and further validation is needed.

B-cell antigenicity is influenced by the antigen abundance during infection and the way it is displayed to the immune system. Our prediction method does not address the former and is only a crude approximation of the latter. For example, the conservation of discontinuous epitopes was not considered in this study as it involves the comparison of surface areas, which might be disrupted by local insertions. Current in-silico predictions suffer from sensitivity issues and this will only improve with larger training sets, i.e. more experimental structures of antigen–antibody complexes8. Nonetheless, our combined predictions accurately ranked the spike RBD and N-term nucleoprotein as top choice candidates for specific SARS-CoV-2 diagnostics, which was subsequently confirmed by ELISA serological tests. This prediction method can speed the design of serodiagnostics for future pandemics, as demonstrated in this test case for SARS-CoV-2/COVID-19 diagnostics. It can in principle incorporate results from any epitope prediction tool, as long as results are reported as one discrete value per residue. Raw data for estimating the effects of variants can alternatively be sourced directly from public databases, such as NextStrain16, ViPR17 and the CNCB18. However, due to data restrictions at GISAID, sequence alignments cannot be retrieved from those resources; we chose to perform our own variant analysis to be able to inspect protein alignments and retrieve strain information from individual sequences.

The ELISA described is a low-cost method to sensitively and specifically detect antibody to SARS-CoV-2. Both antigens are high expressors, though one is made in E. coli (N-Nt) and one in mammalian cells (S-RBD). We estimate the cost to be only pennies per test. In comparison, the total cost of commercial tests to the patient ranges from $40 to $140 per test, including $10/test for the reagents alone. We have shipped this test to eight low-to-middle income countries as a cost-effective test to monitor their infection rates and for research purposes. Further validation to address specificity is required prior to use for patient management, especially in low prevalence populations. However, the low frequency of reactivity in 106 individuals suggests that antibodies to more frequent viruses such as influenza, RSV, CMV, HSV and others are not cross-reactive with these antigens.

Reagent availability

Samples of N-Nt and S-RBD proteins have been deposited in the BEI Resources repository and can be acquired from https://www.beiresources.org/. The amount of N-Nt and S-RBD in one tube (1 mg) would support the coating of 10,000 wells (individual assays, if not in duplicate or triplicate) for ELISA testing.

Data and code availability

The file containing the raw data and the associated code to produce all the charts in Figure S1 are available on https://github.com/ssgcid/sc2_epitopes .

Materials and methods

Linear epitope analysis

Unless otherwise specified, the reference genome throughout this study is from the severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1 (GenBank: MN908947.3). The corresponding protein sequences were selected as follows: polyprotein ORF1ab (UniProt accession: P0DTD1), spike (P0DTC2), ORF3a (P0DTC3), E envelope protein (P0DTC4), M membrane protein (P0DTC5), ORF6 (P0DTC6), ORF7a (P0DTC7), ORF7b (P0DTD8), ORF8 (P0DTC8), N nucleoprotein (P0DTC9), ORF9b (P0DTD2), ORF10 (A0A663DJA2) and ORF14 (P0DTD3). The 15 mature products of polyprotein cleavage, nsp1, nsp2, nsp3 PL-PRO, nsp4, nsp5 3CL-PRO, nsp6, nsp7, nsp8, nsp9, nsp10, nsp12 RdRp, nsp13 helicase, nsp14 ExoN-N7-MTase, nsp15 NendoU and nsp16 MTase, collectively referred to as non-structural proteins (NSPs), were obtained from the chain features annotation of polyprotein ORF1ab on ViralZone19.

The comprehensive database of endemic human coronavirus (HCoV) protein sequences was built from 3 sources: NCBI, UniProtKB and ViPR. The UniProtKB and NCBI Identical Protein Groups databases were searched by the NCBI taxonomy identifiers for human coronaviruses strains HKU1, 229E, OC43 and NL63. Non-human strains sequences from taxonomy subnodes of strain 229E were excluded. Sequence variants manually annotated in the UniProtKB Swiss-Prot entries were extracted with VARSPLIC20. ViPR sequences were obtained by searching the taxonomy tree on the Coronaviridae Gene/Protein search form for strains HKU1, 229E and NL63, and by performing a search by keyword for ["OC43" or " HCoV-OC43"], followed by parsing of the Organism and Strain fields from the fasta headers and grouping by values. Non-human coronavirus sequences and fasta records with no protein sequences were discarded. Although NCBI, UniProtKB and ViPR datasets overlap, we found that each contained unique sequences: One for NCBI, 12 for ViPR and 51 for UniProtKB. These unique sequences could potentially influence results since a single variant on a 7-residue epitope has a 14% impact on the conservation score. The combined dataset of 15,188 sequences was reduced to 1,740 after redundancy elimination and clustering with CD-HIT to remove 100% overlapping fragments21. Polyprotein sequences were identified by blastp similarity searches against the reference and NSPs extracted to avoid false predictions across boundaries, yielding 1,138 unique NSP sequences. These were added back to the non-polyprotein dataset. After removal of 27 sequences with known artificial mutations (originating from the PDB), the final HCoV reference database contained 2,234 protein sequences.

Linear B-cell epitopes were predicted for the 12 non-polyproteins and 15 NSPs of the reference SARS-CoV-2 isolate with BepiPred2.0, using a minimum length of 7 amino acid residues and minimum score of 0.55 (80% specificity/30% sensitivity)5. To reduce the chance of introducing bias towards lower conservation score due to length, long epitopes were trimmed or split to 15 residues prior to scanning for conservation against the endemic HCoV database, by increasing the local BepiPred2 threshold until the stretch of continuous high scores was at most 15 amino acid long. Epitope conservation was estimated by aligning the epitope against each sequence in the endemic HCoV database using a sliding window and calculating the pairwise identity at each position, averaged over the length of the epitope. Epitope conservation was defined as the maximal pairwise identity value obtained after scanning was completed.

Conformational epitope analysis

Coordinates for experimental structures of SARS-CoV-2 proteins were downloaded from the PDB. In case multiple structures were available, the apo structures with highest sequence coverage and highest resolution were selected. To ensure the highest possible coverage of the SARS-CoV-2 structome, a repository of homology models was built at the onset using Robetta (https://www.ssgcid.org/cttdb/molecularmodel_list/?target__icontains=BewuA).

Where no PDB structure was available, high confidence (overall confidence score > 0.5, local coordinate error < 3Å) Robetta models were used instead. In case no template was available for homology modelling, de-novo models were obtained from CASP-commons (https://predictioncenter.org/caspcommons/). Top scoring models as ranked by ProQ3D EMA22 were selected that also had the highest local score at epitope locations.

Conformational B-cell epitopes were predicted with DiscoTope2 with a significance threshold of -2.5 (80% specificity/40% sensitivity)23. Scores were scaled at zero for plotting. The dominance score for linear epitopes was defined as the DiscoTope2 score averaged over the length of the linear epitope. Epitopes that were discontinuous due to gaps in the structure were excluded. Conformational epitopes with no corresponding linear epitope were inspected for regions with continuous signal. Dominance and conservation were calculated over these regions if they spanned at least 6 residues. Regions longer than 15 residues with diffuse boundaries were excluded.

SARS-CoV-2 variant analysis

A total of 7,948 high coverage sequences from human isolates were downloaded from GISAID on 04/20/2020, representing 6,797 unique DNA sequences. These were annotated as ‘sequences with < 1% Ns and < 0.05% unique amino acid mutations (not seen in other sequences in the database) and no insertion/deletion unless verified by the submitter’. The unique sequences were aligned to the reference genome with MAFFT experimental version 7.463 using options ‘–auto –addfragments’24,25. The misaligned 3.2 kb fragment EPI_ISL_426413 was identified as Hepatitis B virus isolates JRC-HB01 by MEGABLAST against the NCBI nr database and was removed from the alignment. Further, a total of 40 inserts in the alignment that introduced gaps in the reference sequence were deleted as likely sequencing errors. The 11 largest inserts ranged from 99 to 155 bases, and the next largest were 3 base long. The genome submitter confirmed that those large inserts were likely assembly artefacts that will be corrected. Of the smaller inserts, a three base ‘TTT’ in-frame insertion was observed at nsp6 position 298 in 17 sequences, and is presumably real, although it is in a repetitive region of 8 consecutive Ts and therefore at the limit of accurate PCR amplification of mono-thymidine repeats26. Gaps in aligned sequences were likewise treated as likely sequencing errors and replaced with the unknown base to keep translation in frame with respect to the reference sequence, taking into account the ribosomal frameshift in the nsp12 coding sequence. This process resulted in just 95 remaining premature terminations in over 160 K proteins; 43 of these were found consistently after amino acid 125 of nsp10 and likely represent true truncations. The others were distributed across the sequences of 11 protein families and are likely erroneous. The 27 protein alignments thus obtained, one for each of the SARS-CoV-2 proteins selected for this study, were transformed into a n by m matrix, where n is the length of the reference protein and m the length of the alignment, and processed column-wise. At each position in the alignment, the variant count was obtained by counting unique amino acids that differed from the reference; variability was quantified using the Shannon entropy, calculated as \(SE= -{\sum }_{i=1}^{n}{P}_{i}\times {log}_{2}\left({P}_{i}\right)\) where n is the number of amino-acid types and Pi is the fraction of amino acid type i at that position27,28. Codes for unknown amino acid (X) and translation termination (*) were ignored in all calculations.

Epitope scoring table

Supplementary Materials (Table S4). Epitopes with the highest dominance score and highest specificity were ranked at the top: first the epitopes defined by both the linear and conformational predictions; second, the conformational epitopes where the linear prediction was below the threshold; third, the specific dominant epitope with high variant density. This group was followed by epitopes located on flexible segments, ranked by specificity. The remaining linear epitopes were ranked by specificity, average BepiPred2 score and variant density. At the bottom of the list were the least specific epitopes on flexible segments, followed by those with the highest dominance score.

Protein expression and purification

pET28 vector constructs encoding full length SARS-CoV-2 Nucleoprotein, nsp7, nsp8, nsp9, nsp10, nsp15 and nsp16 were ordered from Genscript. The nucleotide sequences were codon optimized for E.coli expression and in frame with an N terminus 6XHistidine metal affinity tag with a TEV protease cleavage site. Additionally, a pMCSG53 vector construct encoding the N terminus of SARS-CoV-2 Nucleoprotein (47–173), with an N terminal 6X Histidine metal affinity tag and TEV protease cleavage site was gifted from the Center for Structural Genomics of Infectious Diseases (CSGID).

Plasmids were transformed into BL21 (DE3) Rosetta Oxford chemically competent E.coli cells, and selected colonies were re-streaked on LB agar with appropriate antibiotics.

For protein production, starter cultures of PA-0.5G [50 mM Na2HPO4 50 mM KH2PO4 25 mM (NH4)2SO4 2 mM MgSO4 0.2 × trace metals 0.5% glucose 200 µg/mL 18 amino acids] non-inducing media with appropriate antibiotics were grown for 18 h at 25 °C. Antibiotics were added to 2 L bottles of sterile ZYP-5052 [1% tryptone 0.5% yeast extract 50 mM Na2HPO4 50 mM KH2PO4 25 mM (NH4)2SO4 2 mM MgSO4 0.2 × trace metals 0.5% glycerol 0.5% glucose 0.2% lactose] autoinduction media, and the bottles were inoculated with overnight cultures29.

Inoculated bottles were then placed into a LEX bioreactor and cultures grown for 72 h at 20 °C. To harvest, the medium was centrifuged at 6000 RCF for 30 min at 4 °C. Cell paste was frozen and stored at -80 °C prior to purification.

Nsp7, nsp8, nsp9, nsp10, nsp16, nsp15, full-length nucleoprotein and the N-terminal truncation representing the RNA-binding domain (amino acids [AA] 47–173) of nucleoprotein (N-Nt) were all produced in E. coli and complexes assembled by co-lysis of recombinant proteins produced in individual E. coli. For purification, cells were resuspended in lysis buffer [25 mM HEPES (pH 7.0) 500 mM NaCl 5% glycerol 0.5% CHAPS 30 mM imidazole 10 mM MgCl2 1 mM TCEP 1 mM AEBSF 0.025% NaN3] and lysed by sonication. The cell lysate was then cleared by centrifugation at 26,000 g for 45 min at 4 °C. Each recombinant protein was purified from the cell lysate by immobilized metal affinity chromatography. A Ni–NTA column (GE Healthcare, HisTrap FF) was equilibrated in lysis buffer to which the lysate was added. The column was washed with 150 mL of wash buffer supplemented with [25 mM HEPES (pH 7.0) 500 mM NaCl 5% glycerol 30 mM imidazole 1 mM TCEP 0.025% NaN3] and the protein was eluted from the column by the addition of wash buffer supplemented with 350 mM imidazole. In each case, the protein was further purified by size-exclusion chromatography on a GE HiLoad 26/600 Superdex 75 PG or a Superdex 200 PG column previously equilibrated with size-exclusion buffer [25 mM HEPES (pH 7.0) 500 mM NaCl 5% glycerol 2 mM DTT 0.025% NaN3]. Fractions containing the final protein, in a pure state, were pooled and concentrated to 5 mg/mL before being snap frozen in liquid nitrogen and stored at -80 °C.

The prefusion-stabilized trimeric spike (S-RBD) construct was produced at the Institute for Protein Design, Seattle, WA. The expression plasmid was received from Dr. David Veesler, UW Biochemistry, Seattle, WA, with discovery and characterization information as previously described11. The spike protein was expressed using transient transfection in Thermo Expi HEK293 mammalian cells per IPD protocol, harvested after 3 days incubation at 33 °C, and clarified using PDAD (Sigma-Aldrich #409,014). The clarified harvest fluid (CHF) was purified using TaKaRa Talon immobilized metal affinity chromatography (TaKaRaBio 635,504) and buffer exchanged into Tris Buffered Saline at 2–8 °C. The protein was characterized using negative stain Electron Microscopy, SDS-PAGE, and UV–Vis.

Serological assays (ELISA validation)

Recombinant proteins refer to SARS-CoV-2 nsp7/8, nsp9, nsp10/nsp16, spike-RBD, Nucleoprotein full length, N-Nt, and a mix of S-RBD and N-Nt. The recombinant proteins were diluted to 2 µg/mL into sterile 1 × PBS and then used to coat ELISA plates (Immulon 4HBX Flat Bottom Microtiter Plates 96-wells, 50 plates/case, ThermoScientific, Rochester, NY), using 100 µL of 2 µg/mL into each well, covered, and incubated at 4 °C at least overnight (up to seven days). For the combined ELISA, we coated with both S-RBD at 1 µg/mL and N-Nt at 1 µg/mL in 100 µL PBS in each ELISA well. After incubation at 4 °C at least overnight, the plates were washed with a plate washer 2 × with 300 µL per wash (wash solution for this and all washes: PBS with 0.1% Tween20). Then we added 200 µL of 2% BSA (20 mg/mL) in PBS to each well to quench binding, incubating at least 1 h at room temp with shaking. We then washed with a plate washer 2 × at 300 µL per wash. We heat-treated all plasma or sera at 56 °C for 60 min. Tubes were cooled on ice after heat treatment and spun at 10,000 RPM in a 4 °C microfuge for 10 min, saving the supernatant. The supernatant was then diluted to working concentration (1:100) in PBS-0.1% Tween20-2% BSA and spun again at top speed in a 4 °C microfuge (13,000 RPM or 17,000 × g) for 30 min, and the supernatant was recovered for the next step. We then added 50 µL per well of 1:100 diluted plasma or sera, diluted in PBS-0.1% Tween20-2% BSA and then incubated for 90 min at room temp, while shaking. We then washed the plates with a plate washer 3 × 300 µL per wash. We then added 100 µL per well of anti-IgG-HRP (Rockland anti-human-IgG(H + L) goat antibody horseradish peroxidase (HRP) conjugated, Cat# 609–1302) (diluted in same buffer as the sera/plasma), and incubated 1 h at room temp while shaking. After that incubation, we washed with a plate washer 3 × at 300 µL per wash. We then added 100 µL of developing solution (SigmaFAST OPD tablets, Sigma, St. Louis, MO, prepared as per manufacturer), allowed 30 min in the dark at room temperature, and added 25 µL stop solution (3 N HCl). The plates were then covered and read on plate reader at OD 492 nM. Positives were defined as those with OD492 that are greater than the (mean + 3 times standard deviation of mean) of the negative controls. The ELISA was run each time with 10 to 20 control plasma/sera for this calculation.

The samples from those persons that tested positive for SARS-CoV-2 were derived from plasma or sera from adults (≥ 18 yo) of both genders, provided without personal identifiers, from Seattle Children’s SARS2 Recovered Cohort (N = 16) and a convalescent cohort screened for eligibility for plasma donation (N = 57). In addition, seven specimens were residual clinical samples obtained from inpatients at the University of Washington. Forty-two negative control plasma and serum samples were from banked samples collected from healthy male and female adult (> 21 yo) volunteers who had previously participated in PK studies conducted earlier than November 2019 by Dr. Nina Isoherranen’s UW group. A second group of negative control specimens included thirty sera samples from adults participating in University of Washington Virology Research Clinic protocols studying Zoster vaccination earlier than November 2019. A third group of negative controls were 29 serum specimens from the UW Department of Laboratory Medicine that tested negative for COVID-19 in the Abbott Architect SARS-CoV-2 IgG Assay30. All negative control samples were obtained earlier than November 2019. All serum and plasma samples were collected under either the University of Washington or the Seattle Children's IRB approval.

Approval for human experiments statement

All methods were carried out in accordance with relevant guidelines and regulations. All human serum and plasma samples were obtained from donors over 18 yo with informed consent under protocols approved by the University of Washington or Seattle Children’s IRB.