African Americans and European Americans exhibit distinct gene expression patterns across tissues and tumors that are associated with immunologic and infectious functions and environmental exposures

The COVID-19 pandemic has affected African American populations disproportionately with respect to prevalence, morbidity, and mortality. Because gene expression proﬁles represent combined genetic, socioenvironmental, and physiological effects, and could provide therapeutic biomarkers and environmental mitigation strategies, we undertook a large-scale assessment of differential gene expression between African Americans and European Americans. To do this, we mined RNA-Seq datasets from normal and diseased (tumor) conditions whose metadata could be used to evaluate differential patterns. We observed widespread differential expression of genes implicated in COVID-19 and integral to epithelial boundary function, inﬂammation, infection, and reactive oxygen stress. Notably, expression of the little-studied F8A2 gene is up to 40-fold greater in African Americans. F8A2, like F8A1, encodes HAP40 protein, which mediates early endosome movement. African American gene expression signatures reveal increased number or activity of esophageal glandular cells and lung ACE2-positive basal keratinocytes. These ﬁndings have potential to establish prognostic signatures, reﬁne approaches to minimizing risk of severe infection, and improve precision treatment of COVID-19.


Introduction
The COVID-19 pandemic has infected over 31 million people and killed over 970,000 worldwide as of September, 2020 (https://coronavirus.jhu.edu/map.html). Its causative agent, the novel SARS-CoV-2, is an enveloped single stranded RNA virus that infects tissues including epithelial cells in the upper respiratory tract, lung alveoli, GI tract, vasculature endothelium, renal tubules, central nervous system, and myocardium [1][2][3][4][5][6] . The complex combinations and severities of symptoms caused by SARS-CoV-2 include fever, cough, fatigue, dyspnea, diarrhea, thrombosis, stroke, acute respiratory failure, renal failure, cardiac failure; in some individuals these may lead to long-term disability or death 2,5,6 . Differing patterns of disease may result from direct cellular infection, secondary inflammatory repercussions, and circulating immune and necrotic complexes from distal sites of infection and response [7][8][9] . How these attributes confer risk of increased disease severity to individuals is not well understood 4,8,10,11 . Identifying individuals most at-risk for severe COVID-19 infection, and determining the molecular and physiological basis for this risk, is critical to enable more informed public health decisions, and improving our identification and use of precision interventions.
COVID-19 cases and deaths are disproportionately higher among African Americans in the US 10 . This disparity is caused in part by complex combinations of socio-economic factors, including underlying comorbidities, air quality, population density, and health care access 10 . Heritable factors in the human host also influence COVID-19 symptoms [12][13][14][15][16] . To date, several genetic determinants of COVID-19 severity have been partially elucidated. Genetic variants of Angiotensin-Converting Enzyme2 (ACE2), a major human host receptor for the SARS-CoV-2 spike protein, may be linked to increased infection by COVID-19 15 . Human Leukocyte Antigen (HLA) gene alleles have been associated with susceptibility to diabetes and SARS-CoV-2 14 . A

Results
In order to identify genes differentially-expressed (DE) between African American and European Americans we constructed an aggregated dataset of 7,142 RNA-Seq samples encompassing nine non-diseased tissues from GTEx and eight cancers from TCGA 18,19 . Race assignments are self-reported in the metadata; however, many of the individuals sampled identifying as a single race may be from an admixed population 20,21 . We analyzed data and metadata using MetaOmGraph (MOG) 18 , software that supports interactive exploratory analysis of large data to identify and distinguish patterns across multiple dimensions (Table  1 and Supplementary Table S1).

Genes are DE between populations in a tissue-and tumor-specific manner
DE genes were identified for each sample-type, as well as for pooled TCGA and GTEx data (Table 1 and Supplementary Tables  S2-S28). The number of samples affects the power of the DE test (Table 1). To test for potential confounding factors that might explain gene expression pattern differences, we scrutinized differences between African American and European Americans populations controlling for biologically-relevant factors (sex, age, tissue, and cancer sub-type); under these analysis, all DE genes retained statistical significance (Supplementary Tables S29-S55). We used Hartigans' dip test to each gene to evaluate bior multi-modality in gene expression distributions which may imply presence of potential hidden variables that might affect expression of that gene (Additional File 1).

Differences in gene expression between populations are enriched for the inflammation/cytokines, endosomal development, and ROS metabolism network
GO terms related to the interrelated biological processes of inflammation/cytokines, endosomal development, and ROS metabolism are overrepresented among those genes that are DE between African Americans and European Americans (Supplementary Table S56). Similarly, Gene Set Enrichment Analysis (GSEA) of all of the 25 GTEx and TCGA sample-types shows KEGG pathways of immune-and inflammation-related processes are highly enriched (Supplementary Table S57-S58); the single most commonly-enriched pathway (found in 19 of the 25 sample-types) is "cytokine-cytokin receptor interaction" Figure 1 and Additional File 2. The processes of oxidative metabolism/xenobiotic catabolism are enriched in 9 sample-types ( Figure 1 and Additional File 2). Pooled GTEx data detects coordinated changes between African Americans and European Americans associated with four cytokine-related pathways and oxidative drug metabolism (Figure 1 and Supplementary Table  S57).
The seven genes most highly and consistently DE between African Americans vs European Americans across all sampletypes. are: C-C Motif Chemokine Ligand CCL3L3, mitochondrial Glutathione-S-Transferase GSTM1; Nuclear Pore Complex Interacting Protein Family Member NPIPB15, Coagulation Factor VIII Associated genes F8A3 and F8A2; FAM21B; and serine protease PRSS21. Of these, four are directly related to inflammation, endosomal development, and ROS metabolism. Each of these genes is also DE in multiple tissues and cancers. European Americans in pooled GTEx data. GSEA comprehensively analyses data for expression of all genes, rather than only the DE genes. A. The most common pathways enriched among upregulated genes in African Americans for tissue-types in GTEx. Complete list of enriched pathways and sample-types is in Additional File 2. CK-CK, cytokine-cytokine receptor interaction; glutathione-oxidative metabolism includes (oxidative) metabolism of xenobiotics. The full enrichment analysis for each sample-type is shown in Supplementary Table S57-S59. B. The five most highly enriched pathways among upregulated genes of pooled samples from all sample-types in GTEx are: Tol-like receptor signaling; chemokine signaling; primary immunodeficiency; viral protein interaction with cytokine and cytokine receptor; metabolism of xenobiotics by cytochrome P450.  Table S2-S25). The small inducible chemokine, CCL3L3, is more highly expressed in African Americans by up to 7-fold in most diseased and non-diseased sample-types ( Figure 2) (Supplementary Table S2-S25). Several genes that mitigate oxidative stress are DE between African American and European American populations. In particular, GSTM1, a key enzyme involved in oxidative stress, is more highly expressed in African Americans than European Americans across multiple sample-types ( Figure 2)).
F8A1 is more highly expressed by about 2-fold in European Americans under almost every sample-type analyzed ( Figure  3). Conversely, F8A2 and F8A3 are more highly expressed in African Americans. Expression of F8A2 in African Americans is up to 40-fold greater; expression of F8A3 is up to 6.6-fold greater. In LUSC, F8A2 and F8A3 are the only genes DE ą 2-fold (Supplementary Table S7). F8A2 and F8A3 follow a similar trend, being more highly expressed in African Americans (Supplementary Table S2-S25 and Figure 3 and Supplementary Figure 1 Because of the vast differences in expression levels of the three HAP40-encoding genes between African Americans and European Americans, the paucity of literature on HAP40 25 , and the relationships among F8A1, F8A2, and F8A3 genes, we further investigated the sequences, sequence variants, and the expression patterns of these genes. The sequences of the HAP40-encoding proteins from F8A1, F8A2, and F8A3 are identical to each other in human reference genome GRCh38.p13 (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39). We searched for potential allele variants of HAP40 proteins encoded by F8A1, F8A2, and F8A3 in The Genome Aggregation Database (gnomAD) 26 . gnomAD assigns individuals to populations, by clustering of genetic features. Our search identified only very rare sequence variants in the HAP40s encoded by F8A1, F8A2, or F8A3 (gnomAD v3). No structural variants were identified for HAP40 of F8A1 or F8A3; a duplication of 54 aa is, very rarely, present in F8A2 (gnomAD SVs v2.1).
To our knowledge, F8A1, F8A2 and F8A3 gene expression has never been compared. This may be because expression of F8A2 and F8A3 genes is relatively low in most European Americans, and European Americans are the predominant population studied. Furthermore, most RNA-Seq studies report expression of only F8A1 or F8A3 (and not F8A2), presumably aligning all Figure 2. Upregulated expression of CCL3L3 chemokine and mitochondrial glutathione-S-transferase GSTM1 in African Americans compared to European Americans across multiple conditions. A. CCL3L3 is more highly expressed in African Americans over a wide range of sample-types. CL3L3 binds to chemokine receptor proteins including CCR1, CCR3, and CCR5. B. GSTM1 is more highly expressed in African Americans over a wide range of sample-types. GSTM1 is a key player in metabolism of ROS and xenobiotics. (See Supplementary Tables S2-S28 for complete DE analysis). Violin plots summarize expression over each sample across the two populations. AA, African American; EA, European American. Horizontal lines represent mean log expression.˚, MW test for DE significant (BH corrected p-value ă 0.05).˚, Hartigans' dip test, which assesses bimodal distribution potentially corresponding to hidden covariates. For a given gene and sample-type, a bimodal structure could imply presence of underlying hidden variables that affect expression of that gene, such as unreported sub-population structure or other environmental/genetic factors affecting gene expression in that multi-cellular organism. Expression distribution is also influenced by differences in population sizes (significant p-value ă 0.05). FC, fold change AA/EA. GTEx and TCGA violin plots represent the pooled samples from each project. DE were computed within MetaOmGraph (MOG) 18 , in MOG's statistical analysis module; R scripts were executed interactively via MOG to generate the violin plots.  23 . A. F8A1 expression is upregulated in European Americans. B. F8A2 expression is upregulated in African Americans. Violin plots summarize expression over each sample across the two populations. AA, African American; EA, European American. Horizontal lines represent mean log expression.˚, MW test for DE significant (BH corrected p-value ă 0.05).˚, Hartigans' dip test, which assesses bimodal distribution potentially corresponding to hidden covariates. For a given gene and sample-type, a bimodal structure could imply presence of underlying hidden variables that affect expression of that gene, such as unreported sub-population structure or other environmental/genetic factors affecting gene expression in that multi-cellular organism. Expression distribution is also influenced by differences in population sizes (significant p-value ă 0.05). FC, fold change AA/EA. GTEx and TCGA violin plots represent the pooled samples from each project. DE were computed within MetaOmGraph (MOG) 18  We analyzed coexpression of the three F8A genes relative to the other 18,212 genes represented in the full TCGA-GTEx dataset using two statistical measures: Pearson's correlation and Mutual Information (MI). Although the three F8A genes are proximately located on the X chromosome, their expression patterns are not correlated. F8A2 and F8A3 have a Pearson's correlation of (r " 0.40), while F8A1 is negatively correlated with F8A2 and F8A3. Of all 18,212 genes represented in the data, the expression pattern of F8A1 is most negatively (anti-) correlated with that of F8A2 (r "´0.45) and F8A3, (r "´0.24) (Supplementary Table S60). F8A1 expression is not correlated with the F8 (Coagulation Factor FVIII) gene, although it resides with intron 22 of this gene. MI analysis indicates that F8A2 and F8A3 genes are more closely associated with F8A1 than with any other gene, consistent with the negative Pearson correlation (Supplementary Table S60).

Signatures of DE genes correspond to specific cell types in esophagus and lung
Because GTEx samples represent whole tissues, we sought to determine whether genes differentially expressed between African Americans and European Americans corresponded to distinct cell populations present in these tissues. This would provide information on cell-type representation across the two populations. To do this, we evaluated single cell datasets from two tissues highly relevant for SARS-CoV-2 infection: esophagus and lung 27,28 .
Genes upregulated in African Americans in the esophagus map predominantly to two cell lineages, glandular epithelial cells of esophagus glands, and hematolymphoid lineage-associated dendritic cells (Figure 4). In proximal and distal airway cells of the lung, the signature of DE genes in African Americans versus European Americans corresponds to basal differentiating and proliferating keratinocytes ( Figure 5).

Discussion
Human genetics contribute to the propensity and severity of diseases 21,[30][31][32][33][34][35][36][37] . Sometimes the contribution is straightforward; a single allele variation found in Ashkenazi Jews, causes the vast majority of Tay-Sachs disease 36 . Sometimes it is more complex; for example, hypertension is more prevalent in African American than European American populations 37 in part due to detrimental APOL1 mutations that are more frequent in West African populations 31 . Despite the paucity of studies focused on Western African populations, the propensity and severity of other diseases among this population have been attributed to genetics 21,31,38,39 .
Many COVID-19 deaths have been attributed to a cyclic over-excitement of the innate immune system 1,7,16 . This process, often termed a cytokine storm, results in a massive production of cytokines, and the body attacking itself rather than specifically destroying the pathogen-containing cells 1,7 . People with comorbidities, the elderly, and immunosuppressed individuals, may be at a greater risk for COVID-19 morbidity and mortality either because they may not respond to infection with a sufficient immune response 40 and/or because they may be more likely to develop a cytokine storm 1,7 . Notably, many cytokines and other immunomodulatory molecules are DE between African Americans and European Americans in one or more sample-type, cytokine-related KEGG pathways are enriched for DE genes, and cell-type biomarkers indicate enrichment of DE genes in immune-related cell-types. Thus, predominant differences in gene expression, pathway enrichment, and cell-types between African Americans and European Americans are all implicated in biological processes that highly impact COVID morbidity and mortality.
CCL3L3, upregulated in African Americans under almost every diseased and non-diseased sample-type we tested, is also upregulated in COVID-19-diseased human bronchoalveolar lavage fluid 41 . The CCL3 protein, encoded by CCL3L3, is a member of the functionally-diverse C-C motif chemokine family and acts as ligand for CCR1, CCR3, and CCR5. CCL3 is a neutrophil chemotaxis protein, recruiting and activating granulocytes, and inhibiting HIV-1-infection 42 . CCL3L3 is upregulated in COVID-19, and neutrophils themselves are highly implicated in COVID-19 severity 41,43 .
GSTM1, more highly expressed in African Americans compared to European Americans in almost every sample-type evaluated, is a key enzyme in mitochondrial ROS metabolism 44 . Mitochondrially-generated ROS induce expression of proinflammatory cytokines and chemokines, and are considered to play a key role in modulating innate immune responses against RNA viruses 44 including SARS-COV-2 45 . Higher expression of GSTM1, could lead to increased mitochondrial ROS, which might ultimately trigger inflammation and a cytokine storm 44 . Alternatively, higher GSTM1 expression might cause ROS to be metabolized rapidly, and prevented ROS from initiating a sufficient immune response. GSTM1 has a second critical function-in metabolism of xenobiotics, including many toxins and pharmaceuticals 44 . Thus, if GSTM1 is highly expressed, pharmaceuticals may be rapidly metabolized and rendered inactive.
The most dramatic differences in genes expression in African Americans compared to European Americans are associated with the highly-conserved but little-studied F8A genes, which each encode HAP40. F8A1 is upregulated in European Americans, while F8A2 and F8A3 are upregulated in African Americans. HAP40 function has been researched mostly in the context of F8A1 and the critical role of that gene in slowing early endosome mobility in Huntington's disease 24 . In Huntington's, HAP40 forms a bridge between the huntingtin protein and the regulatory small guanosine triphosphatase, RAB5; formation of this . Esophageal gene signatures vary in African Americans and European Americans in a cell-specific manner. Gene signature upregulated in African American versus European American esophagus maps to two cell lineages with prominent presence in the human cell atlas (https://data.humancellatlas.org/) esophageal dataset. One significant fraction of the African American-upregulated gene signature maps to glandular mucous epithelial cells of esophogeal glands (genes marked by red, far right bar). Expression of several of the genes upregulated in African Americans is highly restricted to the mucous epithelial cells (TSAPN8, PRR4, ELAPOR1), whereas FOLR1, for example, is more highly expressed in the ductal epithelial cells of the gland. A second, smaller, signature corresponds to hematolymphoid/myeloid lineage dendritic cells, as shown by CDC1C, PLD4, HERPUD1, and LPXN (genes marked by green, far right bar). In addition to a number of genes that are most strongly expressed by those cell types, there are several genes that are essentially exclusively expressed by those cells. ToppCell-constructed gene modules (http://toppcell.cchmc.org) for each of the cell types reported to be present in the large scRNA-Seq dataset from esophagus 29 .

Figure 5.
Lung gene signatures upregulated in African Americans versus European Americans map to proximal airway keratinocytic epithelial lineage, and to mesenchymal mesothelial and neuroendocrine cells. Marker genes for kerotinocytes (genes marked by yellow, far right bar); ciliated epithelial cells (genes marked by turquoise, far right bar); mesothelial mesenchymal cells (genes marked by red, far right bar); and neuroendocrine mesenchymal cells (mesenchymal). Note that the keratinocytic proximal basal epithelial cell is the cell subtype with the highest expression of ACE2 receptor, a major target of COVID-19 (ACE2 marked by black on bar at right). ToppCell-constructed gene modules (http://toppcell.cchmc.org) for each of the cell types reported to be present in the large scRNA-Seq dataset from lung 28 . complex reduces endosomal motility by shifting endosomal trafficking from the microtubule to the actin cytoskeleton 23 . F8A1 overexpression in striatal neuron cell lines from mice resulted in increased ROS and mitochondrial dysfunction 46 . Knockouts of F8A1 in human HeLa and HEK293 cells yield altered/reduced autophagy and shorter life spans 46 . Knockouts of the single F8A gene in Drosophila similarly show reduced activity, altered/reduced autophagy, and shorter lifespan 47 .
F8A1 expression is increased under several conditions, including Huntington's disease 48 , presence of a SNP variant for type 1 diabetes risk 49 , cytotrophoblast-enriched placental tissues in women with severe preeclampsia 50 , and mesenchymal bone marrow cells as women age 51 . Its potential role in the latter conditions has not been investigated.
Altered endosome motility would play an important but complex role in infection and the innate immune response, and might either promote or hinder the battle between SARS-CoV-2 and its human host 22,52 . Coronaviruses including SARS-CoV-2 mainly enter host cells via binding to the ACE2 receptor followed by endocytosis 7,53 . Nascent early endosomes are moved along the microtubule cytoskeleton, fusing with other vesicles; varied molecules can be incorporated into the endosomal membrane or its interior 22,52 . This regulated development enables diverse fates. For example, in the context of SARS-CoV-2, endosomes might release viral RNA or particles; they might merge with lysosomes and digest their viral cargo; or they might fuse with autophagosomes (autophagy) and subsequently with lysosomes that digest the cargo 22,52 . SARS-CoV-2 might reprogram cellular metabolism to suppress autophagy and promote viral replication 54 ; conversely, the cell might modify autophagy machinery to decorate viral invaders with ubiquitin for eventual destruction, activate the immune system by displaying parts of the virus, or catabolize excess pro-cytokines. Autophagy might induce cytokine signaling, which could promote protective immune response or engender a destructive storm of cytokines, inflammation and tissue damage 22 . Because of its function in early endosome motility, HAP40 has implications as a potential molecular target in therapy of endosomal and autophagy-related disorders such as COVID-19.
Our analyses using single cell reference data indicate several cell type-specific associations of the signatures of DE genes in African Americans versus European Americans. One model by which this might occur is that individuals of one population tend to have different proportions of a given cell type or histological structure. An alternative model is that individuals of one population might tend to maintain some of their cell types in a state of relatively higher activation. Either explanation would lead bulk RNA-Seq analyses, such as sample-types from GTEX or TCGA, to demonstrate elevated expression of those transcripts in that population.
Although at a population level, major differences exist in expression of immunity-related genes and cell-type-specific associations between African Americans and European Americans, when considered on an individual basis, gene expression differences are more complex. Individuals within a population may exhibit all, none, or some portion of these differences. That some genes show bimodal expression distribution in some sample-types African American and/or European American populations further emphasizes this variation.
Thus, the significance of these patterns and their relationship to differential susceptibility or risk of severity from COVID-19 infection must be considered from nuanced perspectives. Importantly, it may be that only a fraction of the signature and a fraction of the individuals in a population are at elevated risk of more severe disease. In addition, different mechanisms of risk may be operative within different individuals within an population. For example, elevated abundance or activity of cells that are the target of COVID-19 (e.g., ACE2-positive basal keratinocytes) could lead to a greater infection burst during initial phases with a larger number of virions being released systemically. If, as it appears from the alignment of the DE genes in African American compared to European Americans to the lung single cell data, this is the case for African American-individuals, then they might be more readily taken over by infecting SARS-Cov2 virions.
The differential expression of genes implicated in COVID-19 morbidity and mortality between African Americans and European Americans reported herein emphasizes the importance of integrating gene expression data into the factors considered in studying this pandemic at a population level. Further, RNA-Seq data has been shown useful in clinical practice for pediatric cancers 55 , and this practice could be extended to other diseases. We hypothesize, in concurrence with 17,56 , that processes of disease-and stress-related genes are overrepresented among the DE genes of African American and European American populations in part because the ancestral selection pressure due to disease and stresses (such as temperature and toxins) was very strong, with very different complements of pathogens and stresses in the regions where these populations lived. To survive, humans living in Europe and those living in Western Africa would have had to evolve the ability to resist the diverse prevalent local pathogens and stresses. Other differences would be due to a difference in socioeconomic environment 57 .
The utility of expression data is tremendous, but it is reliant on adequate representation of cohorts and on sufficient metadata. For example, ethnic bias and practical factors (such as subject availability) often result in insufficient numbers of subjects from many populations being represented in medical studies 58,59 ; this lack of representation impedes the development of precision prognosis and therapy based on genetics 34,58 . Here, we were limited to comparison of differences between gene expression in African American and European American populations because even in the large GTEx and TCGA studies, sample sizes for the other three major population groups (Asian, Native American, and Pacific Islanders) were generally too low for robust statistical assessment (Supplementary Table S1).
In addition, even if sample sizes for race are sufficient, information on the ancestry of each individual sampled is needed. Self-reported metadata on race is often not publicly available for individual samples. However, methods of assigning ancestry to individuals sampled for RNA-Seq are being developed and applied 60,61 A major dichotomy exists between socio-economic and genomic investigations. Among the vast body of human RNA-Seq data deposited, not only are metadata on the ancestry of the sampled individuals often unavailable, but socio-economic metadata (postal code, education, income, occupation) are almost never present. Thus, apart from the pioneering sociogenomics research of 57,57,62 and the studies of 56,63,64 , socioeconomic information are rarely considered in 'omics analyses. Indeed, because of the scant metadata on socioeconomic determinants it is not even possible to determine possible skewness of representation of socioeconomic groups among the individuals sampled; thus, socioeconomic factors represent high-impact complex hidden covariates that would be challenging to model. Conversely, sociological studies rarely incorporate 'omics information. For example, the U.S.-based Robert Wood Johnson Foundation (https://www.rwjf.org/en/library/ interactives/whereyouliveaffectshowlongyoulive.html) cites research that "your zip code can be more important than your genetic code" for your health; however, the analyses were done without actually evaluating genetic codes.
Because socioeconomic data was absent in this study, we were unable to distinguish genetic effects from socioeconomic causes. Despite having metadata on ancestry, we could not resolve the component due to genetics, from that to socioeconomic factors, and were limited to reporting population-based differences in gene expression (rather than ancestry-based differences). Privacy concerns need to be carefully balanced against the very real health benefits that can be gained from metadata access. Without routine inclusion and availability of diverse metadata for human 'omics samples, data mining is hampered, and important information is lost.
In summary, multiple genes implicated in COVID-19 immunity and inflammation are DE across African American and European American populations. This differential expression is evident despite the fact that race is self-reported in the metadata, and many Americans are racially admixed 21 . By highlighting the wide-ranging differences in expression of genes implicated in the morbidity and mortality of COVID-19 across populations, and by revealing apparent cell-type differences between populations, we provide a baseline for future study and emphasize the importance of harvesting these types of information for medicine. Such research will establish prognostic signatures with vast implications for precision treatment of diseases such as COVID-19.

Datasets
GTEx provides data representing "non-diseased" samples from diverse tissues. Non-diseased refers to the tissue itself, however, in some cases the individual sampled was postmortem and the causes of death are varied. TCGA project is the largest project available on different diseased samples (tumors) of multiple tissue origins. Both projects have metadata on the (self-reported) races of the individuals who contributed samples. These two projects provide a unique opportunity to evaluate differences in gene expression across populations in multiple sample-types that vary by tissue and disease status. Tissues and cancers were selected for downstream analysis based largely on having sufficient numbers of individuals from each ancestry. We refer to those self-reporting as "Black or African American" as "African Americans" and "White" as "European Americans".
The data files and the precompiled MOG project, MOG_HumanCancerRNASeqProject, were downloaded from http:// metnetweb.gdcb.iastate.edu/MetNet_MetaOmGraph.htm 18 . This project uses batch-corrected and processed data to enable comparison across samples 19 . MOG_HumanCancerRNASeqProject contains expression values for 18,212 genes, 30 fields of metadata detailing each gene, across 7,142 samples representing 14 different cancer types and associated non-tumor tissues (TCGA and GTEX samples) integrated with 23 fields of metadata describing each study and sample 18 .

Statistical and correlation analyses
The MOG tool was used to interactively explore, visualize and perform differential expression and correlation analysis of genes.
The Mann-Whitney (MW) test was used to identify DE genes between two groups; we chose this non-parametric analysis as it makes no assumptions about the data distribution. We define a gene as DE 2-fold or more between two groups if it meets each of the following criteria: 1. Estimated fold-change in expression of 2-fold or more (log fold change, |logFC| ě 1), where logFC is calculated as in limma 65 .)

Covariate evaluation
To check for potential sampling differences between populations that might confound the analysis, we fit linear models using limma 65 in R, to adjust for for biologically relevant, potential confounding factors of race, gender, tissue/tumor type, age and cancer subtypes. (Supplementary Table S29-S55). Because ratios of cancer sub-types may differ between races (as reported for breast cancer in premenopausal African American women), we evaluated the RNA-Seq data from African Americans and European Americans in BRCA samples for potential confounding effects due to different ratios of four breast cancer subtypes: basal-like (BAS), human epidermal growth factor receptor-2 positive/estrogen receptor negative (Her2), luminal A (LumA), and luminal B (LumB); all genes DE with an >2-fold change in MW analysis retained statistical significance in limma analysis of BRCA data, although the fold-change level varied (Supplementary Table S41).

Gene expression enrichment
Overrepresentation of biological processes and other functional analysis was assessed at https://toppgene.cchmc.org/. Geneset enrichment analyses (GSEA) were performed using the clusterProfiler library in R 66 .

Cell-type analysis
African American-vs-European American gene signatures to cell types and compartments were mapped using cell type specific gene modules derived from a series of single cell gene expression datasets for esophagus 29 and lung 28 hosted in ToppCell (http://toppcell.cchmc.org/) using the ToppGene tool (http://toppgene.cchmc.org/). Heat map visualization of genes differentially-expressed by African Americans versus European Americans in each cell type module in the selected tissues was done using Morpheus (https://software.broadinstitute.org/morpheus/) using ToppCell's "super binned" gene expression for each cell type within each single cell dataset.

Data availability
We subscribe to an open data model (https://www.go-fair.org/fair-principles/). MOG is free and open source software published under the MIT License. MOG software, user guide, and the MOG_HumanCancerRNASeqProject project datasets and metadata described in this article are freely downloadable from http://metnetweb.gdcb.iastate.edu/MetNet_ MetaOmGraph.htm. MOG's source code is available at https://github.com/urmi-21/MetaOmGraph/. Detailed information and code on how to reproduce the results, along with Additional files, are available at https://github. com/urmi-21/COVID-DEA.

Funding
This work is funded in part by the National Science Foundation award IOS 1546858, "Orphan Genes: An Untapped Genetic Reservoir of Novel Traits" and by the Center for Metabolic Biology, Iowa State University. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. In particular, it used the Bridges HPC environment through allocations TG-MCB190098 and TG-MCB200123 awarded from XSEDE and the HPC Consortium.