Benchmarking evolutionary tinkering underlying human–viral molecular mimicry shows multiple host pulmonary–arterial peptides mimicked by SARS-CoV-2

The hand of molecular mimicry in shaping SARS-CoV-2 evolution and immune evasion remains to be deciphered. Here, we report 33 distinct 8-mer/9-mer peptides that are identical between SARS-CoV-2 and the human reference proteome. We benchmark this observation against other viral–human 8-mer/9-mer peptide identity, which suggests generally similar extents of molecular mimicry for SARS-CoV-2 and many other human viruses. Interestingly, 20 novel human peptides mimicked by SARS-CoV-2 have not been observed in any previous coronavirus strains (HCoV, SARS-CoV, and MERS). Furthermore, four of the human 8-mer/9-mer peptides mimicked by SARS-CoV-2 map onto HLA-B*40:01, HLA-B*40:02, and HLA-B*35:01 binding peptides from human PAM, ANXA7, PGD, and ALOX5AP proteins. This mimicry of multiple human proteins by SARS-CoV-2 is made salient by single-cell RNA-seq (scRNA-seq) analysis that shows the targeted genes significantly expressed in human lungs and arteries; tissues implicated in COVID-19 pathogenesis. Finally, HLA-A*03 restricted 8-mer peptides are found to be shared broadly by human and coronaviridae helicases in functional hotspots, with potential implications for nucleic acid unwinding upon initial infection. This study presents the first scan of human peptide mimicry by SARS-CoV-2, and via its benchmarking against human–viral mimicry more broadly, presents a computational framework for follow-up studies to assay how evolutionary tinkering may relate to zoonosis and herd immunity.


Introduction
Viral infection typically leads to T-cell stimulation in the host, and autoimmune response associated with viral infection has been observed 1 . SARS-CoV-2, the causative agent of the ongoing COVID-19 pandemic, has complex manifestations ranging from mild symptoms like loss of sense of smell (anosmia) 2 to severe and critical illness 3,4 . While some molecular factors governing SARS-CoV-2 infection of lung tissues, such as the ACE2 receptor expressing cells have been characterized recently 5 , the mechanistic rationale underlying immune evasion and multi-system inflammation (Kawasaki-like disease) remains poorly understood 6,7 .
The SARS-CoV-2 genome encodes 14 structural proteins (e.g., Spike protein) and nonstructural proteins (e.g., RNA-dependent RNA polymerase), as depicted in Fig. 1a. The nonstructural ORF1ab polyprotein undergoes proteolytic processing to give rise to the following proteins: NSP1, NSP2, PL-PRO, NSP4, 3CL-PRO, NSP6, NSP7, NSP8, NSP9, NSP10, RdRp, Hel, ExoN, NendoU, and 2′-O-MT. The human reference proteome consists of 20,350 proteins, which when alternatively spliced, result in over 100,000 protein variants (Fig. 1b) 8 . Here, we investigate the potential for molecular mimicry in the context of immune surveillance and host-antigen recognition in COVID-19, by performing a systematic comparison of known MHC-binding peptides from humans and mapping them onto SARS-CoV-2-derived peptides (see "Methods"). For this, we compute the longest peptides that are identical between SARS-CoV-2 reference proteins and human reference proteins, thus creating a map of COVID-19 host-pathogen molecular mimicry. Based on this resource, we extrapolate the potential for HLA Class-I restricted, T-cell immune stimulation via synthesizing established experimental evidence around each of the mimicked peptides.

Methods
Computing the SARS-CoV-2 peptides that mimic MHC class-I binding peptides on the human reference proteome The reference proteome for SARS-CoV-2 consists of 14,221 8-mers and 14,207 9-mers, that result in 9827 distinct 8-mers and 9814 distinct 9-mers (see Table S1). The 8-mers and 9-mers are generated by using a sliding window, moving one amino acid at a time, resulting in overlapping linear peptides. The reference human proteome, on the other hand, consists of 11,211,806 peptides that are 8-mers and 11,191,459 peptides that are 9-mers, resulting in 10,275,893 unique 8-mers and 10,378,753 unique 9-mers. Including alternatively spliced variants increases the unique peptide counts to 11,215,743 8-mers and 11,342,401 9-mers.
Thirty-one 8-mer peptides and two 9-mer peptides are identical between the reference proteomes of SARS-CoV-2 and humans, after including alternatively spliced variants (Fig. 1b, Table S2). For comparison, we analyzed the protein sequences from 9434 viral species (taxons) from NCBI RefSeq (see "Methods"). On average, around 45.57 unique 8-mer/9-mers per viral taxon were shared with the Fig. 1 Molecular mimicry and immunomodulatory potential. a n-mer peptide generation. b Mimicked peptides between SARS-CoV-2 and human proteomes. c Comparison of human-protein mimicking SARS-CoV-2 peptides with peptides from other human coronaviruses. d Immunomodulatory potential of mimicked peptides from SARS-CoV-2.
human proteome (mean = 45.57, median = 12, SD = ±135.38). In order to control for the complexity and constraints of the amino acid sequences, we also analyzed the distribution of mimicked 8-mers/9-mers normalized by the total number of unique 8-mers/9-mers present in each viral taxon. On average, a fraction of 0.002 8-mers/9mers out of all the unique 8-mers/9-mers in the virus are identical between the viral proteome and the human proteome (mean = 0.002, SD = ±0.011) (Fig. 1, Supplemental Fig. 1). The fraction of 33 human-mimicking 8mers/9-mers is proximal to the mean. Overall, this suggests that the presence of 33 human-mimicking 8-mers/9mers in SARS-CoV-2 is not surprising compared to the number of human-mimicking 8-mers/9-mers present in other viruses.
In SARS-CoV-2, no 10-mer or longer peptides are identical between the pathogen and the host reference proteins. Of these, 29 peptides (8-mer/9-mer) mimicked by SARS-CoV-2 map onto nine of the 14 SARS-CoV-2 proteins and 29 of the 20,350 human proteins. By including alternative splicing, the 33 8mers/9-mers mimicked by SARS-CoV-2 map onto 39 of the 100,566 protein splicing variants. That is, 0.16% of the human proteins and 0.04% of all the known splicing variants have 8-mer/9-mer peptides that are mimicked by the SARS-CoV-2 reference proteome. Given that MHC Class I alleles typically engage peptides that are 8-12 mers 9 , the analysis that follows was restricted to mapping host-pathogen mimicry from an immunologic perspective across 8-mer and 9-mer peptides only.
Comparative analysis of SARS-CoV-2 peptides mimicking the human proteome with the reference SARS, MERS, and seasonal HCoVs Of the 33 peptides from SARS-CoV-2 that mimic the human reference proteome, 20 peptides are not found in any previous human-infecting coronavirus (SARS, MERS, or seasonal HCoVs) ( Table 1). The UniProt database was used to download the 14 protein reference sequences for SARS-CoV. The non-redundant set of protein sequences from other coronavirus strains (HCoV-HKU1:188; HCoV-229E: 246; HCoV-NL63: 330; HCoV-OC43: 910; and MERS: 681) was computed by removing 100% identical sequences, and the remnant sequences were all included in the comparative analysis with SARS-CoV-2 mimicked peptides. A Venn diagram depicting the overlap of mimicked peptides across different human coronaviruses was generated (Fig. 1c).
Given the zoonotic transmission potential of coronaviruses from other organisms to humans, we also considered 8-mer/9-mer peptides derived from 13,431 distinct protein sequences of all non-human coronaviridae from the VIPERdb 10 .
Characterizing the SARS-CoV-2-derived 8-mers/9-mers that mimic established human MHC-binding peptides The lengths of the human proteins mimicked by SARS-CoV-2 were considered to examine any potential bias towards the larger human proteins (Fig. 1, Supplementary  Fig. 1). For instance, human Titin (TTN), despite containing 34,350 amino acids and being of the longest human proteins, does not have even one peptide mimicked by SARS-CoV-2. The longest and shortest human proteins that are mimicked by SARS-CoV-2 are MICAL3 (length = 2002 amino acids) and BRI3 (length = 125 amino acids), respectively.
The sequence conservation of each mimicked peptide was derived from all the 46,513 sequenced SARS-CoV-2 genomes available in the GISAID database (as on 06/13/ 2020).
The immune epitope database (IEDB) 11 was used to examine the experimentally established, in vitro evidence for MHC presentation against human or SARS-CoV antigens. The peptides of potential immunologic interest were identified from the IEDB database using the following pair of assays. One of the assays involved purification of specific MHC-class I alleles and estimating the K d values of specific peptide-MHC complexes through competitive radiolabeled peptide binding 12 . The other assay uses mass spectrometry proteomic profiling of the peptide-MHC complexes, where the MHC complexes were purified from the cell lines specifically engineered to produce mono-allelic MHC class I molecules. The identity of the peptide sequence bound to the class-I MHC molecules was elucidated using mass spectrometry 13 .

Analysis of RNA expression in cells and tissues
A distribution of RNA expression across all the expressing samples collected from GTEx, Gene expression omnibus, TCGA, and CCLE is created. In this distribution, a high-expression group is defined as the set of samples associated with the top 5% of expression level. Enrichment score captures the significance of the token in the high-expression group. The significance is captured based on Fisher's test along with Benjamini-Hochberg correction. During the comparison of gene expression across tissue types in GTEx, the specificity of expression is computed using "Cohen's D", which is an effect size used to indicate the standardized difference between two means.

Analysis of overlapping peptides between the proteome and viral proteomes
We analyzed the protein sequences from 9434 viral species (taxons) from NCBI RefSeq (https://ftp.ncbi.nlm. nih.gov/refseq/release/viral/). On average, around 45.57 unique 8-mer/9-mers per viral taxa were detected to be identical to a known human protein (mean = 45.57, Table 1    The viral-human mimicked 8-mer/9-mer peptides are highlighted in green text. median = 12, SD = ±135.38). The highest number of identical peptides were detected in Pandoravirus dulcis virus with 4423 peptides having an exact identical match to one or more human proteins.

Results
Identifying specific human peptides mimicked by SARS-CoV-2 and with in vitro evidence for MHC binding A set of 20 8-mer/9-mer human peptides are mimicked by SARS-CoV-2 and no other human coronaviruses (see Methods). Of these 20 peptides, four peptides are constituted within established MHCbinding regions (Table 1-panel a). The four peptides with specific MHC-binding potential that are novel to SARS-CoV-2 map onto the following human proteins: alpha-amidating monooxygenase (PAM), annexin A7 (ANXA7), peptidylglycine 6-phosphogluconate dehydrogenase (PGD), and centromere protein I (CENPI) (Fig. 1d). Analyzing the sequence conservation of the SARS-CoV-2-exclusive peptides shared with the above four human MHC-binding peptides, shows that these SARS-CoV-2 peptides are largely conserved till date ( Table 2). The previous human-infecting coronavirus strains (SARS-CoV, MERS, and seasonal HCoVs) are notably bereft of these novel SARS-CoV-2 epitopes. An alternatively spliced variant of the human arachidonate 5-lipoxygenase activating protein (ALOX5AP-ENSP00000479870.1; ENST00000617770.4) contains an 8-mer peptide that is mimicked by SARS-CoV-2 as well as SARS-CoV, but not any of the seasonal HCoVs. This peptide has in vitro evidence for positive MHC class-I binding. In addition, there are four human helicases (MCM8, DNA2, MOV10L1, and ZNFX1), each containing peptides with established evidence of MHC class-I binding that are also mimicked by SARS-CoV-2 and by previous human-infecting coronaviridae strains (Table 1 -panel c; Fig. 1d) ( Table 2).
Novel mimicry of human PAM, ANXA7, and PGD by SARS-CoV-2, suggests an enrichment of mimicked peptides in lung, esophagus, arteries, heart, pancreas, and macrophages Considering bulk RNA-seq data from over 125,000 human samples with non-zero PAM expression shows that PAM is highly expressed in pancreatic islets (enrichment score = 276.9 across 6 studies and 552 samples), artery (enrichment score = 244.5 across 3 studies and 343 samples), heart (enrichment score = 243.6 across 19 studies and 442 samples), aorta (enrichment score = 217.1 across 2 studies and 304 samples), embryonic stem cells (enrichment score = 146.7 across 2 studies and 352 samples), and fibroblasts (enrichment score = 107.7 across 28 studies and 215 samples) (Fig. 2a). Among 54 human tissues from GTEx, PAM is particularly significant within aortic arteries (n = 432, Cohen's D = 3.1, mean = 348.2 TPM) compared to other human tissues. Moderate specificity in gene expression is also noted for the atrial appendage of the heart (n = 429, Cohen's D = 2.2, mean = 461.3 TPM) (Fig. 2b). Further, immunohistochemistry (IHC) based data on 45 human tissues 14 shows that the PAM protein is detected at high levels in heart muscles, epididymis, and the adrenal gland (Fig. 2b).
Exploring all available human single-cell RNA-seq (scRNA-seq) data shows PAM is expressed in nearly 100% of pancreatic gamma cells, alpha cells, beta cells, delta cells, and epsilon cells as well as between 50 and 90% of activated/quiescent stellate cells, acinar cells, endothelial cells, ductal cells of the pancreas. It is also expressed in over 80% of cardiomyocytes, and 40-70% of heart fibroblasts, macrophages, endothelial cells, and smooth muscle cells, as well as in 26-27% of lung pleura fibroblasts, stromal cells, and neutrophils (Fig. 2c). Moreover, analyzing the scRNA-seq data from severe COVID-19 patient's lung bronchoalveolar lavage fluid shows high PAM expression in club cells (Fig. 2d), which intriguingly also express the SARS-CoV-2 receptor ACE2 significantly 5 . Furthermore, esophagus scRNA-seq analysis shows esophageal mucosal cells and stromal cells as significant PAM expressors (Fig. 2e). Finally, rarer cell types such as pulmonary neuroendocrine cells and goblet cells of the lungs, and some common cell types like lung serous cells and respiratory secretory cells also express PAM significantly.
Similar to the expression profile of PAM, examining 130,400 human samples with nonzero ANXA7 expression shows that ANXA7 is highly expressed in pancreatic islets (enrichment score = 286.65; 543 samples; 3 studies) and artery (enrichment score = 161.68; 184 samples; 3 studies) ( Fig. 3-Supplementary Fig. 1). ANXA7 is significantly expressed in the aortic artery (n = 432, Cohen's D = 2.1, mean = 163.1 TPM) and the tibial artery (n = 663, Cohen's D = 2.6, mean = 176.4 TPM) (Fig. 3-Supplementary Fig. 2). Analysis of the scRNA-seq data on ANXA7 confirms expression in endothelial cells across multiple tissues and organs, and also indicates expression in lung type-2 pneumocytes, macrophages, oligodendrocytes (Fig. 3a). Type-2 pneumocytes are noted to express the SARS-CoV-2 receptor ACE2 from scRNAseq 5 . Analyzing the lung bronchoalveolar lavage fluid scRNA-seq data from patients with severe COVID-19 outcomes shows macrophages, lung epithelial cells, Tcells, club cells, proliferating cells, and plasma cells are significant expressors of ANXA7 (Fig. 3b). Additionally from the study of normal lungs, activated dendritic cells and lymphatic vessel cells are noted to express ANXA7 significantly (Fig. 3c).  Assessment of around 128,000 human samples shows that PGD is highly expressed in esophagus mucosa (enrichment = 323, n = 510, 2 studies), blood (enrichment = 320.5, n = 1020, 39 studies), and macrophages (enrichment = 141.6, n = 202, 4 studies). (Fig. 4, Supplementary   Fig. 1). IHC data on 45 human tissues from the Human Protein Atlas 14 confirms that PGD is detected at high levels in the esophagus (Fig. 4, Supplementary Fig. 2), and additionally in the testes, tonsils, bone marrow, gallbladder, spleen, and placenta. Unlike the expression profiles of PAM, ANXA7, and PGD, CENPI's expression is fairly non-specific and relatively negligible from available data sets. Mild-to-moderate expression of CENPI is seen in precursor B cells and late erythroid cells, but further studies are needed to ascertain the significance, if any, of CENPI expression, including in the context of COVID-19. Multi-pronged mimicry of PAM, ANXA7, and PGD by SARS-CoV-2 and its potential for factoring into the pulmonary-arterial autoinflammation seen in severe COVID-19 patients Positive HLA-B*40:02 binding has been established for the human PAM peptide ("KEPGSGVPVVL") and the ANXA7 peptide ("VESGLKTIL") [15][16][17] , that contain the distinctive mimicking SARS-CoV-2 peptides ( Table 1). The closely related HLA-B*40:01 allele also binds this mimicked ANXA7 peptide 17 . The corresponding mimicking peptides of PAM and ANXA7 are from the viral RNA-dependent RNA polymerase and NSP2 protein, respectively, which are constituted within the MHCbinding regions (highlighted above in bold text). In addition, HLA-B*35:01 has experimental evidence for positive binding of the human PGD peptide mimicked by the SARS-CoV-2 virus (Table 1).
Given the high expression of ANXA7, PGD, and PAM among cells of the respiratory tract, lungs, arteries, cardiovascular system, and pancreas, as well as in macrophages, their striking mimicry by SARS-CoV-2 raises the possibility of individuals with HLA-B*40 and HLA-B*35 alleles being predisposed to potential immune evasion or autoinflammation. Indeed, the potential for broad vascular/ endothelial autoinflammation is consistent with the rarer multi-system inflammatory syndrome or atypical Kawasaki disease noted in few COVID-19-infected children 18,19 .
Alternatively spliced human protein variants analysis for mimicry by SARS-CoV-2 highlights another HLA-B*40:01 restricted protein (ALOX5AP) with autoimmune potential A splicing variant of ALOX5AP (ENSP00000479870.1 and ENST00000617770.4) containing the 8-mer peptide "PEANMDQE" is one of four alternatively spliced human protein variants that are mimicked by SARS-CoV-2. The other three 8-mer peptides arising from splicing variants do not have any known class I MHC binding reported in the immune epitope database. However, SARS-CoV, which is the only other human-infecting coronavirus in addition to SARS-CoV-2 that also contains PEANMDQE, has been experimentally established to possess the PEANMDQESF epitope that positively binds to the HLA-B*40:01 allele.
ALOX5AP is known from literature knowledge synthesis to be associated with ischemic stroke, myocardial infarction, atherosclerosis, cerebral infarction, and coronary artery disease (Fig. 5a). Single-cell RNA-seq studies show numerous types of macrophages expressing ALOX5AP, including in the lungs and brain temporal lobe. Epithelial cells and proliferating cells of the lungs also express ALOX5AP, as do other types of immune cells such as Tcells, neutrophils, and dendritic cells (Fig. 5b). Taken together with the HLA-B*40:01 restricted binding of the PAM and ANXA7 peptides mimicked by SARS-CoV-2, the ALOX5AP splicing variant also mimicked by SARS-CoV-2 suggests the possibility of immune evasion or autoinflammation in HLA-B*40-constrained COVID-19 patients.
Mimicked HLA-A*03-binding peptides are shared between HCoV helicases and the human helicases There are seven human protein mimicking peptides that are shared between SARS-CoV-2 and at least one other human-infecting coronavirus (Table 1c). These proteins include DNA2, MCM8, MOV10L1, ZNFX1, which are all helicases. Analysis of single-cell RNAseq suggests that the mimicked human proteins are expressed in various celltypes including neuronal cells and immune cells (Fig. 6). The HLA-A*03:01 allele has been established from in vitro experiments to bind SARS-CoV helicase peptides that mimic 8-mer peptides from human MOV10L1, DNA2, ZNFX1, and MCM8 helicases (summarized in Table 1) 20 . The HLA-A*31:01 and HLA-A*11:01 alleles, on the other hand, are known to bind peptides containing the human MOV10L1, DNA2, and ZNFX1 helicase mimics; whereas the HLA-A*68:01 allele has established in vitro evidence of binding peptides containing the human DNA2 and MOV10L1 helicase mimics 21 . In some of the individuals carrying these HLA alleles (Table 1c), a positive T-cell response against their "self" cells that express and display the above coronavirus-mimicked peptides seems plausible.

Discussion
The presence of identical peptides between viruses and humans has at least two potential implications from an immunologic standpoint. On the one hand, upon presentation of the viral antigens on the surface of infected cells, the virus may evade immune response by masquerading as a host peptide and the recognition of the shared peptides by host regulatory T cells could promote a generally immunosuppressive environment. On the other hand, an autoimmune response can lead to virus-induced autoinflammatory conditions 22 . Either response requires the coupling of both the presence of the appropriate HLA allele and positive T-cell response toward the mimicked peptide epitopes 23 . It is possible that SARS-CoV-2 leverages one or both of these molecular mimicry strategies to exploit the host immune system. In a small minority of patients who happen to have the unfortunate combination of MHC restriction and T-cell receptors as mentioned above, the specific tissues and cell types harboring the mimicked protein would bear the brunt of sustained autoimmune damage. The autoimmune lung and vascular damage reported in severe COVID-19 patient mortality [24][25][26] necessitates hypothesis-free examination of both these mimicry strategies.
Our study suggests HLA binding of peptides based on existing literature, but existing literature is by no means exhaustive for identifying HLA binding 11 . There is no known HLA class-I mediated positive T-cell response against certain 8-mers documented in the immune epitope database. For example, GPPGTGKS peptide is shared by the viral helicase and human VPS4A, VPS4B, and SETX. This peptide is also shared with seasonal human coronaviruses (HCoV-OC43 and HCoV-HKU1) and previous SARS strains (SARS-CoV and MERS). Further experiments are required to assess any potential for autoinflammation.  5 Evidence for ALOX5AP from biomedical knowledge synthesis and single-cell RNA-seq. a Knowledge synthesis suggests involvement of ALOX5AP in ischemic stroke, myocardial infarction, atherosclerosis, cerebral infarction, and coronary artery disease. b scRNA-seq shows significant expression of ALOX5AP in proliferating cells, macrophages, T-cells, and epithelial cells from the lungs and macrophages of the brain.