Introduction

In late 2019, cases of severe pneumonia with unknown etiology were reported in Wuhan, China. The novel coronavirus, SARS-CoV-2, was identified as the causative agent for this disease, called coronavirus disease 2019 (COVID-19). COVID-19 was declared a pandemic in March 20201; according to the World Health Organization (WHO) dashboard (https://covid19.who.int/), as of June 8, 2022, there have been 530,896,347 confirmed cases and 6,301,020 deaths due to COVID-19 globally.

COVID-19 has three consecutive stages of increasing severity2. In the early stage, flu-like symptoms appear followed by viral pneumonia. The second stage is characterized by pulmonary inflammation and coagulopathy with increased levels of inflammatory biomarkers. The third stage of the disease is associated with fibrosis. Disease severity and mortality is associated with higher levels of inflammatory markers and increased serum levels of inflammatory cytokines3. Moreover, numerous clinical symptoms and pathologies have been reported in individual COVID-19 patients. Data also suggests that the causative virus, SARS-CoV-2, could act as a trigger for the development of a rapid autoimmune responses4,5,6. For example, Guillain-Barré syndrome, an immune-mediated disorder where the cross-reactivity of anti-pathogen antibodies with host proteins plays an important role, has been associated with COVID-197,8.

Following an infection, the physiological role of the immune system is to identify and eliminate the pathogen. However, pathogenic infections have also been associated with autoimmunity wherein aberrant immune responses are elicited against host proteins. Such immune responses may be linked to numerous human diseases, e.g., diabetes mellitus type 1, systemic lupus erythematosus, celiac disease, Henoch-Schönlein purpura, sarcoidosis, Graves’ disease and idiopathic thrombocytopenic purpura9,10. One mechanism that may contribute to autoimmunity involves pathogen-derived antigens that are similar to host antigens but differ enough to induce an immune response11.

Several computational studies12,13 have sought to identify homologous regions between pathogen-derived proteins and human proteins. However, these methods, which are based on sequence homology, cannot capture structural homologies. Here, we present a novel strategy for comparing the surface structure of individual chains between two different proteins.

A critical mass of biomedical information associating COVID-19 infections with autoimmune diseases has emerged14,15,16 [for reviews see17,18,19]. Additionally, a study20 has reported the result of a high-throughput assay to detect autoantibodies in 194 SARS-CoV-2 infected COVID-19 patients. The study found an increase in autoantibodies in the COVID-19 patients compared to uninfected controls. This report provides a comprehensive survey of clinical pathologies associated with SARS-CoV-2 infection and the human proteins that could be associated with these pathologies. Additionally, we describe a novel computational tool to compare the 3D structures of SARS-CoV-2 proteins and human proteins. We list 102 human genes with high structural homology to SARS-CoV-2 proteins. We do not claim that these 102 human genes are necessarily linked to human disease. These data sets are “hypothesis generating” and constitute a useful resource for scientists and clinicians.

Materials and methods

Obtaining PDB files

PDB files were obtained from the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB)21, an open access repository of protein structural information. The data set analyzed 22,867 protein structure files. These PDB files were split into two subsets of files, 22,556 human protein files and 26 SARS-CoV-2 protein files. These structures are not all unique proteins; the complete list of all PDB accession codes for human and SARS-CoV-2 proteins used in this study is provided in Supplementary Tables 1 and 2 respectively. The proteins were split into macromolecular chains to prepare the data for input into the ProBiS22 algorithm.

Generating surface areas

For each protein/chain, solvent accessible surface atoms are calculated23. The surface calculation identifies vertices based on physiochemical properties and is done individually for each protein chain within the ProBiS algorithm. These surface calculations are pre-computed and locally stored to speed up the later computations.

Aligning proteins

The ProBiS algorithm searches for structurally similar sites on a local scale by finding all matches to each query protein chain within the proteome list supplied. The alignment of two protein chains, one from the query protein and one from the human proteome, is based on finding similar regions between the two chains.

The benefit of the ProBiS algorithm to this exercise is that it is focused on local alignments. The algorithm will begin with vertices relating to three amino acid residue surface regions and expand outwards along the backbone.

Drawing protein structures

Graphical protein structure were created using PyMol v 2.3.524.

Quantification and statistical analysis

Each match between a query chain and a proteome chain is scored in four ways:

  • Surface vectors angle A vector orthogonal to the geometric mean of the surface of each protein at the aligned area is calculated. The angle between these two vectors is calculated. If the angle is less than 90° (1.571 rad), the alignment will be retained in the results.

  • Surface patch root mean squared distance (RMSD) The distance between each set of vertices in the match is calculated and the RMSD is calculated from this from the following formula.

    $${\text{Distance}} = \sqrt {\left( {x_{1} - x_{2} } \right)^{2} + \left( {y_{1} - y_{2} } \right)^{2} + \left( {z_{1} - z_{2} } \right)^{2} }$$
    $${\text{RMSD}} = \sqrt {\frac{1}{n }\mathop \sum \limits_{i = 1}^{n} \left( {Distance_{i} } \right)^{2} }$$

    where: Each vertex exists in three-dimensional space and has x, y, and z coordinates. There are n sets of vertices.

  • Surface patch size The algorithm requires that alignments must contain at least 10 vertices.

  • E-values E-values for a particular alignment are calculate using the Karlin-Altschul Equation25 in a similar fashion to evaluating the quality of matches in a sequence alignment according to the Karlin-Altschul Equation:

    $$E=kmn{e}^{-lS}$$

    where: E = The Expect-value of the alignment (E-value); k and λ are constants (k = 0.134 and λ = 0.3176, these are the often-used values for an ungapped alignment in a structural homology search); m is the number of vertices in the query (SARS-CoV-2) alignment fragment; n is the number of vertices in the library (human protein); S is the substitution score calculated using the sum of scored for each substitution using a BLOSUM62 matrix.

Go Term and KEGG pathway analysis

Both analyses were performed in R26 and calculated using the topGO27 package. The universe of genes for both analyses were the list of proteins for which a PDB file was available rather than the entire human proteome. All p-values were adjusted using the Benjamini–Hochberg method28 as implemented in R. A threshold of 0.05 was used as a cut off for significance using adjusted p-values.

Data visualizations

All Figures were created using ggplot229.

Data and Code availability

A list of PDB IDs and chain IDs for both human and SARS-CoV-2 proteins has been provided in the supplementary data along with all code used to generate the alignments (Supplementary Tables 1 and 2 and Supplementary File-Codes-1). The URLs for the tools and databases used in this study are provided in Supplementary Table 3.

Results

The overall approach and selection of structures for determining cross-reactivity of anti-SARS-CoV-2 antibodies with endogenous human proteins

The overarching goal of this theoretical study is to identify endogenous human proteins that could plausibly be targeted by antibodies that are elicited against a pathogen. Previous computational strategies have used sequence homology to identify human proteins that may be similar to viral antigens12,13,30,31. Here we have adopted a novel strategy that seeks to identify structural homologies using SARS-CoV-2 as a model system. The workflow of our strategy to identify conformational homology between SARS-CoV-2 proteins and endogenous human proteins is illustrated in Fig. 1. An important limitation of the method we adopted is the available protein structures represent approximately 35% of the human proteome, despite large increases in publicly available datasets in recent years. However, researchers tend to solve the structures of proteins deemed biologically or clinically important. Thus, the available human protein structures are likely to represent those most likely associated with clinical indications.

Figure 1
figure 1

Structural homology identification workflow. A description of our workflow for identifying the steps for identifying homologous protein structures. First, obtain PDB files human and SARS-CoV-2 Wuhan strain proteins. Second, filter out redundant and non-endogenous human proteins. Third, split all proteins into macro-molecular chains. Fourth, find matches between SARS-CoV-2 chains and human chains. Fifth, identify which chains found in step 4, belong to proteins which could have clinical indications related to COVID-19.

Identification of human proteins with structural homology to SARS-CoV-2 proteins

A detailed description of the novel computational workflow we used to identify endogenous human proteins with structural homology to SARS-CoV-2 proteins has been provided in the “Methods” section. The human and SARS-CoV-2 proteins used in this analysis are listed in the Supplementary Tables 1 and 2 respectively. Identifying structural homologies relies on parsing databases of known protein structural files down to non-redundant lists of host organism proteins, splitting those proteins into distinct macro-molecular chains, generating surface structures for each chain, and matching endogenous human chains to SARS-CoV-2 protein chains.

For this study, the criterion for homology between a SARS-CoV-2 and an endogenous human protein was based on four criteria. Proteins were considered structurally homologous based on: (i) the angles of the orthogonal vectors of the surface patches, (ii) the root mean squared distances between each set of vertices, (iii) the size of the surface patches, and (iv) the e-value of the match. The rationale for these criteria was built into the software implementation of the ProBiS algorithm22,23. Using these criteria, we identified 346 human proteins with structural homology to SARS-CoV-2 proteins. These proteins are listed in Supplementary Table 4. A full record of the matches including alignment lengths, chains and scores is provided in Supplemental Table 5. An illustrative example is provided in Fig. 2, showing the structural alignment between SARS-CoV-2 spike protein chain A (Protein Data Bank (PDB) ID: 6LXT) and the human protein complement factor B (PDB ID: 1RTK).

Figure 2
figure 2

Surface matches between SARS-CoV-2 and complement factor B. We have included both a cartoon diagram (A) and a surface diagram (B) to show one of the human proteins (Complement Factor B, PDB ID: 1RTK, chain A, cyan) with structural homology to a SARS-CoV-2 protein (Spike Protein S2 subunit, PDB ID: 6LXT, chain A, green). This match (yellow) satisfied all of our criteria for inclusion as a significant match with an RMSD of 0.2 Å, an alignment score of 6.7, 15 aligned vertices, and sva value of 0.3 and an e-value of 1.43 × 10–5.

Clinical indications associated with SARS-CoV-2 infections

A list of COVID-19 -related clinical pathologies was compiled following a review of the medical literature. COVID-19 related clinical pathologies were categorized as general (involving multiple organ systems) or the immune system (immunopathology), or specific to an individual organ system including lungs (pneumopathy), heart (cardiopathy), blood and clotting (hemopathy and coagulopathy), liver (hepatopathy), kidney (nephropathy), gastrointestinal, or brain (neuropathy). In Table 1 we provide a comprehensive list of the clinical indications and symptoms associated with SARS-CoV-2 infections and provide a numerical code for each indication/symptom. The numerical codes were used for easy cross-referencing to human proteins that could potentially be (i) associated with COVID-19 related clinical pathologies based on their known function and (ii) targeted by anti-SARS-CoV-2 antibodies (see below).

Table 1 Clinical indications and symptoms associated with SARS-CoV-2 infections.

Human structural homologues of SARS-CoV-2 proteins and clinical pathologies associated with COVID-19

We also assessed which of the 346 human proteins found to have structural homology to SARS-CoV-2 could be associated with reported COVID-19 pathologies. General literature search strategies included the gene name in addition to key terms such as coronavirus, molecular mimicry, autoimmunity, pathology, knock-out/down in some combination within the PubMed search engine. In some instances, no hits were available to provide evidence of connection to COVID-19 pathologies. However, theoretical associations were hypothesized based on protein localization and function. Protein candidates were tiered. Proteins expressed on the plasma membrane surface or secreted into the extracellular space were considered to have the highest probability of being bound and inhibited by a cross-reactive anti-SARS-CoV-2 antibody. Based on this strategy, of the 346 human proteins that are structural homologs of SARS-CoV-2 proteins, 102 were identified as having biological functions that could be associated with COVID-19 pathologies or symptoms upon inhibition by cross-reactive antibody responses. We have tabulated and encoded clinical indications associated with COVID-19. The COVID-19-related clinical indications potentially associated with each of these 102 human proteins are shown in Table 2.

Table 2 Human proteins with both structural homology to SARS-CoV-2 proteins and associations with clinical indications.

The biology of human structural homologues of SARS-CoV-2

We have identified the Gene Ontology (GO) Terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways enriched by the list of 346 structurally homologous proteins (Fig. 3)27,32,33. Based on literature searches we have listed the biological functions of the 102 human proteins which show structural homology with SARS-CoV-2 proteins. These proteins exhibit diverse biological functions and could thus potentially affect multiple biochemical pathways (Table 2, Fig. 4).

Figure 3
figure 3

Biological processes related to predicted homologous proteins. Over enriched GO terms (A) and over enriched KEGG Pathways (B) are shown here. These show which terms and pathways are overrepresented in our set of 346 proteins which have significant structural homology to a SARS-CoV-2 protein. Terms with adjusted p-values less than 0.05 are shown here. The size of each dot shows the number of proteins represented in our data set which fall in that category. The color of the dot shows the adjusted p-value of the enrichment.

Figure 4
figure 4

An overview of clinical indications. The distribution of top-level Clinical Indication shows that proteins with structural homologies span a wide range of clinical indication.

Discussion

Immunodominant antigens introduced by pathogens have the potential to trigger cross-reactive immune responses that may impact the function of bystander endogenous host proteins with shared structural epitopes and result in an autoimmune disease. Some examples of cross-reactive immune responses following viral infection include type 1 diabetes mellitus [coxsackievirus; cytomegalovirus; enterovirus], multiple sclerosis [cytomegalovirus; Epstein Barr virus; measles virus; Theiler’s virus; Varicella-Zoster virus; West Nile virus], immune thrombocytopenia [Hepatitis C virus], myasthenia gravis [West Nile virus], Guillain-Barré syndrome [Zika virus], and rheumatoid arthritis [cytomegalovirus; Epstein-Barr virus]34,35,36.

Since the first COVID-19 cases were identified, a diverse array of clinical indications have been reported to be associated with SARS-CoV-2 infections (Table 1). Although SARS-CoV-2 is primarily a respiratory pathogen, there can be multiple organ systems involvement. Thus, hematological, cardiovascular, neurological and gastrointestinal complications have been reported in COVID-19 patients37,38,39,40,41,42,43,44. Many of these pathologies are difficult to explain on the basis of the route of entry and/or sites of SARS-CoV-2 infection. In this theoretical study we have explored the hypothesis that cross-reactivity of antibodies that target the SARS-CoV-2 proteins to endogenous human proteins may play a role in at least some of the dizzying array of clinical presentations of COVID-19 patients. Clinical data suggest that this is a plausible hypothesis to investigate. A prospective study involving 22 German patients suggests that SARS-CoV-2 infection could elicit organ specific autoimmunity in susceptible patients and lead to respiratory failure45. Similarly, a retrospective study of 21 patients with critical SARS-CoV-2 pneumonia, detected autoantibodies related to autoimmune disease17. In one study, a high-throughput auto-antibody detection technique was applied on 194 SARS-CoV-2-positive subjects20. These subjects showed higher levels of autoantibodies to diverse human antigens compared to controls who were SARS-CoV-2 negative.

Several in silico approaches have been used to identify and study potential cross-reactivity between pathogen-derived and human proteins12,13. These studies primarily rely on identifying amino acid sequence homologies between proteins from the pathogen and endogenous human proteins. In subsequent analysis some of the studies have endeavored to then determine which (if any) of these homologous regions are potential T cell epitopes based on their affinities for HLA class I and class II alleles30,31. We on the other hand, rationalized that anti-SARS-CoV-2 antibodies that cross-react with endogenous human proteins and elicit auto-immune pathologies are more likely to interact with conformational domains. However, using computational tools to compare conformational homologies between host and pathogen proteins is far more challenging than carrying out a sequence homology. We have limited our search to proteins which have a known structure and PDB file and used the ProBiS algorithm22,23 to match surface patches on chains of human proteins to surface patches on SARS-CoV-2 proteins [See “Materials and methods”]. Using this method, we have identified 346 human proteins that show structural homology to SARS-CoV-2 viral antigens (Supplemental Table 4).

We concomitantly identified proteins that are linked to clinical conditions or symptoms reported for COVID-19. We carried out an in-depth literature survey to record and classify many clinical manifestations reported in patients diagnosed with COVID-19 (Table 1). We also used a numeric notation for each clinical condition to allow cross-referencing. Our analysis shows that of the 346 human proteins that showed structural homology to SARS-CoV-2, 102 proteins have biological functions which, if disrupted, could result in pathologies associated with COVID-19 (the pathologies and proteins are depicted in Tables 1 and 2). We again emphasize that these 102 human genes have not been experimentally verified but provide a data set which could provide a useful resource. This list could be a starting point for carrying out in vitro studies to elucidate the mechanistic basis of clinical observations.

We have identified human proteins with structural homology to SARS CoV-2 proteins that, if functionally inhibited (e.g., by cross-reactive anti-SARS CoV-2 antibodies), may be mechanistically implicated in the development of severe COVID-19 clinical manifestations. An exhaustive discussion of each candidate human protein and its possible implication to COVID-19 clinical presentation is not possible. However, we discuss several examples of identified candidate proteins and their possible connection to severe COVID-19 illness to illustrate the potential, practical utility of our theoretical study and resultant data sets. Some genes that may be related to severe COVID-19 pathophysiology and are good candidates for experimental investigation include PRKG1, ACE, CFB, CRP, CTNNB1, EGFR, and VEGFA.

The PRKG1 gene product, protein kinase G1 (PKG-1), is serine/threonine-specific protein kinase that is activated by Cyclic guanosine monophosphate (cGMP) 2. PKG1-1 regulates vascular smooth muscle relaxation and modulates the contractility, growth, and apoptosis of cardiomyocytes46,47,48,49. Involvement of the cGMP-PKG signaling pathway in the cardiac contractility makes the PRKG1 gene a possible candidate for heart failure in COVID-19 patients.

The renin-angiotensin system (RAS) is known for its effects on the cardiovascular system and fluid hemostasis50. Increased activity of the vasoconstrictive and proliferative axis such as angiotensin II/ Angiotensin-converting enzyme (ACE)/ AT1 has been reported to be associated with a higher risk of acute thrombosis through the destabilizing of atherosclerotic plaque and enhancing the platelet activity and coagulation51. ACE2 shares 40% identity and 61% similarity with ACE52. SARS-Cov-2 infection mediated by ACE2 and TMPSRSS2 proteins is well established53. ACE2 is expressed in cells from multiple tissues, including airways, cornea, esophagus, ileum, colon, liver, gallbladder, heart, kidney and testis54. Similarly, to SARS-CoV55 infection with SARS-CoV-2 may downregulate cell surface expression of ACE2 and may result in reduced activity of ACE2 in infected organs. Moreover, binding of ACE2 to SARS-CoV, and most likely with SARS-CoV-2, increases the activity of disintegrin and metalloproteinase domain-containing protein (ADAMTS17)57 which can induce the shedding of ectodomain form of ACE2 and detectable the soluble ACE258. The shedding of myocardial ACE2 into the circulation and its association with heart disease in preclinical models suggest that the loss of tissue ACE2 plays a pathogenic role in heart disease59,60. Varying ACE2 expression might affect disease susceptibility and progression. Generally, ACE2 expression is highest in children, young people, and women, decreases with age and is lowest in people with underlying conditions such as diabetes and hypertension. Therefore, lower levels of expression of the viral receptor ACE2 are found in those at the highest risk for progression of COVID-19 to a severe disease phenotype61,61.

The liver is the major site of complement synthesis. Complement factor B is a protein encoded by the CFB gene. Complement factor B generally called as Factor B, plays a role in the alternative pathway like the role of C2 in the classical pathway. Factor B binds to C3b and is activated to form proteolytic enzyme that cleaves C3. Recently, it has been reviewed systematically in the literature about the COVID-19 associated thrombosis and over activation of complement cascade56. A preprint by Gao et al.62 reported that the SARS-CoV, MERS-CoV and SARS-CoV-2 nucleocapsid (N) proteins were found to bind to MBL-associated serine protease-2 via lectin pathway of complement activation, resulting in aberrant complement activation and aggravated inflammatory lung injury63.

C-reactive protein (CRP) is a normal plasma protein and elevates during cytokine-mediated response to most forms of tissue injury, infection and inflammation and serum CRP values are widely measured in clinical practice as an objective index of disease activity64. The upregulation of C reactive protein (CRP) that has been reported in COVID-19 patients might be an indication of excessive inflammatory stress and contribute to severe illness or even death65,66,67. Moreover, it has been shown that elevated CRP levels in COVID-19 patients is strongly associated with Venous thromboembolism, acute kidney injury, critical illness, and mortality68.

Type 1 interferon production is impaired in severe COVID-19 patients and leads to Acute Respiratory Distress Syndrome (ARDS) and coagulopathy. Matsuyama et al. reviewed COVID-19 pathophysiology with respective to NSP1 and ORF6 proteins via induction of signal transducer activator of transcription 1 (STAT1) dysfunction and compensatory hyper activation of STAT369. IFN signaling was inhibited by upregulated EGFR and activated STAT370. This review also emphasized the “STAT3 and Coagulopathy” with the production Tissue Factor induced by CRP which may have activated by STAT3 and prime the initial phase of coagulation.

Catenin beta-1 is also known as β-catenin. Activation of β-catenin, the primary mediator of the ubiquitous Wnt signaling pathway, alters the immune system in lasting and harmful ways71. It has been demonstrated that the activation of Wnt/ β-catenin signaling enhances influenza virus replication72. Wnt signaling is a complex mechanism of signal transduction pathways mediated by multiple signaling molecules. These molecules are involved in many disease conditions73. Specifically, Wnt family genes FZD4, FZD5, CTNNB1 and downstream targets CCDN1, VEGFA, axin2 were upregulated in end-stage of Pulmonary Arterial Hypertension condition, which is a life-threatening disease associated with increase pulmonary pressures, subsequently followed by development of right-sided heart failure73.

This study has limitations. Most importantly we have used a computational method to compare the conformations of human proteins with SARS-CoV-2 proteins. The underlying postulate is that shared structural homology would result in cross reactivity. We do not however have a direct computational measure of cross reactivity. Additionally, we rely on conformational similarities and do not weigh our scores for protein conformers that may be inaccessible to antibodies. Another limitation is that the available human protein structures represent approximately 35% of the human proteome. Similarly, variants of SARS-CoV-2 are concern are important, but we have kept this study focused on the wild type (Wuhan strain) SARS-CoV-2 for 2 reasons: (1) Essentially there have been no major changes in COVID-19 associated disorders/pathologies with the emergence of the new variants. (2) In the absence of reliable literature on pathologies associated with individual variants the data will be almost impossible to interpret. The final limitation of this study is that although we list 102 human genes with high structural homology to SARS-CoV-2 proteins these have not been experimentally validated. Hence, we do not claim that these are linked to human disease. Overall, the datasets generated here are “hypothesis generating” and provide a useful resource.

Evidence has emerged that SARS-CoV-2 infections are associated with auto antibodies and that these have the potential to elicit autoimmune pathologies. We have developed a novel computational approach to identify human proteins that have conformational features similar to SARS-CoV-2 proteins. Thus, there is a likelihood that these human proteins could be targeted by anti-SARS-CoV-2 antibodies. This method and list of human proteins is a resource that can be utilized to study the phenomenon of autoimmune pathologies associated with COVID-19.

Conclusions

In this theoretical study we have identified multiple human proteins with strong structural homology to SARS-CoV-2 proteins. Of these, we posit 102 human proteins could potentially be both (i) associated with COVID-19 related clinical pathologies based on their known function and (ii) targeted by anti-SARS-CoV-2 antibodies. The data sets we have generated using novel computational methods present testable hypotheses to elucidate molecular mechanisms that could explain the complex multi-system disorders associated with COVID-19.